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Abstract: We propose a general and formal statistical framework for multiple 
tests of association between known fixed features of a genome and unknown 
parameters of the distribution of variable features of this genome in a popu- 
lation of interest. The known gene-annotation profiles, corresponding to the 
fixed features of the genome, may concern Gene Ontology (GO) annotation, 
pathway membership, regulation by particular transcription factors, nucleotide 
sequences, or protein sequences. The unknown gcnc-parameter profiles, corre- 
sponding to the variable features of the genome, may be, for example, re- 
gression coefficients relating possibly censored biological and clinical outcomes 
to genome-wide transcript levels, DNA copy numbers, and other covariates. A 
generic question of great interest in current genomic research regards the detec- 
tion of associations between biological annotation metadata and genome-wide 
expression measures. This biological question may be translated as the test of 
multiple hypotheses concerning association measures between gene-annotation 
profiles and gene-parameter profiles. A general and rigorous formulation of the 
statistical inference question allows us to apply the multiple hypothesis test- 
ing methodology developed in [Multiple Testing Procedures with Applications 
to Genomics (2008) Springer, New York] and related articles, to control a 
broad class of Type I error rates, defined as generalized tail probabilities and 
expected values for arbitrary functions of the numbers of Type I errors and 
rejected hypotheses. The resampling-based single-step and stepwise multiple 
testing procedures of [Multiple Testing Procedures with Applications to Ge- 
nomics (2008) Springer, New York] take into account the joint distribution of 
the test statistics and provide Type I error control in testing problems involv- 
ing general data generating distributions (with arbitrary dependence structures 
among variables), null hypotheses, and test statistics. 

The proposed statistical and computational methods are illustrated using 
the acute lymphoblastic leukemia (ALL) microarray dataset of [Blood 103 
(2004) 2771-2778], with the aim of relating GO annotation to differential gene 
expression between B-cell ALL with the BCR/ABL fusion and cytogenetically 
normal NEG B-cell ALL. The sensitivity of the identified lists of GO terms 
to the choice of association parameter between GO annotation and differen- 
tial gene expression demonstrates the importance of translating the biological 
question in terms of suitable gene-annotation profiles, gene-parameter profiles, 
and association measures. In particular, the results reveal the limitations of 
binary gene-parameter profiles of differential expression indicators, which are 
still the norm for combined GO annotation and microarray data analyses. Pro- 
cedures based on such binary gene-parameter profiles tend to be conservative 
and lack robustness with respect to the estimator for the set of differentially 
expressed genes. Our proposed statistical framework, with general definitions 
for the gene-annotation and gene-parameter profiles, allows consideration of a 
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much broader class of inference problems, that extend beyond GO annotation 
and microarray data analysis. 
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1. Introduction 
1.1. Motivation 

Experimental data, such as microarray gene expression measures, gain much in rele- 
vance from their association with biological annotation metadata, i.e., data on data, 
such as, GenBank sequences. Gene Ontology terms, KEGG pathways, and PubMed 
abstracts. A challenging and fascinating area of research for statisticians concerns 
the development of methods for relating experimental data to the wealth of meta- 
data available publicly on the WWW. Tasks include accessing and pre-processing 
the data, making inference from these data, and summarizing and interpreting the 
results. 

In this context, an important class of statistical problems involves testing for 
associations between known fixed features of a genome and unknown parameters of 
the distribution of variable features of this genome in a population of interest. Here, 
features of a genome are said to be fixed, if they remain constant among population 
units. In contrast, variable features are allowed to differ among population units. 
Fixed features typically consist of gene annotation metadata, that refiect current 
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knowledge on gene properties, such as, nucleotide and protein sequences, regula- 
tion, and function. Variable features often consist of gene expression measures, 
that reflect cellular type/state/activity under particular conditions. The fixed and 
variable features define, respectively, gene-annotation profiles and gene-parameter 
profiles; the parameter of interest then corresponds to measures of association be- 
tween known gene- annotation profiles and unknown gene-parameter profiles. 

For instance, for the yeast Saccharomyces cerevisiae (in short, S. cerevisiae), one 
may be interested in detecting associations between the vector of mean transcript 
(i.e., niRNA) levels for all (approximately 6,500) genes under heat-shock condi- 
tions and Gene Ontology (GO) annotation for these genes. The reader is referred 
to the Gene Ontology Consortium website (www.geneontology.org) and to Sec- 
tion 4, below, for more information on gene ontologies, and to the Saccharomyces 
Genome Database (SGD) website (www.yeastgenome.org), for details on S. cere- 
visiae. In this example, the population of interest may consist of all heat-shocked 
cells from well-defined cultures of a particular strain of S. cerevisiae (e.g., strain 
S288C). For each of the three gene ontologies (BP, CC, and MF, as described in 
Section 4.1), each gene is annotated with a fixed set of GO terms (i.e., this set is 
constant across population units for a given version of the GO Database). Thus, 
for a given GO term, one may define a gene-annotation profile as a known, fixed 
binary vector indicating for each gene whether it is annotated or not with the 
particular GO term. The transcript levels, however, vary among population units 
and the gene-parameter profile, i.e., the vector of genome- wide mean transcript 
levels in the population of heat-shocked yeast cells, is unknown and may be esti- 
mated, for example, from a microarray experiment involving a sample of yeast cells 
from the population. The association parameter of interest, between GO annota- 
tion and transcript levels, is then a vector of association measures (e.g., two-sample 
t-statistics) between the known binary gene- annotation profiles and the unknown 
continuous gene-parameter profile. 

Similar inference questions arise in many other contexts and involve a variety 
of definitions for the gene-annotation profiles, the gene-parameter profiles, and the 
association parameters of interest. For example, in cancer microarray studies, one 
may seek associations between GO gene-annotation profiles and a gene-parameter 
profile of regression coefficients relating (censored) patient survival data to genome- 
wide transcript levels or DNA copy numbers. Furthermore, gene- annotation profiles 
need not be binary or even polychotomous, and may correspond to pathway mem- 
bership, regulation by particular transcription factors, nucleotide sequences, and 
protein sequences. 

Note that, for the sake of illustration, we focus on gene-level features. However, 
our proposed methodology is generic and may be applied to other types of features, 
such as those concerning gene isoforms and proteins. For instance, as in alterna- 
tive splicing microarray analysis, one may collect data at the finer level of gene 
isoforms, where one gene may have multiple isoforms [10]. In this context, isoform- 
parameter profiles may refer to the distribution of microarray isoform expression 
measures in a well-defined population, while isoform- annotation profiles may con- 
sist of exon/intron counts/lengths/nucleotide distributions. One may also consider 
protein-level features, where, for example, protein-parameter profiles correspond 
to antibody microarray expression measures and protein- annotation profiles refer 
to protein function, domain structure, and post-translational modification (e.g., 
Swiss-Prot, www . expasy . org/sprot). 
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1.2. Contrast with other approaches 

Existing approaches for tests of association with biological annotation metadata 
focus primarily on relating microarray gene expression measures and GO annota- 
tion. Relevant articles and software packages include: FatiGO from the BABELOMICS 
suite ([1, 2]; www.babelomics.org); GOstat ([4]; gostat.wehi.edu.au); 
Ontologizer ([22]; www.charite.de/ch/medgen/ontologizer); CSEPCT ([2<S]; 
genomeS . ucsf . edu: 8080/cgi-bin/compareExp . cgi); GSEA-P (['-•^], [•>(>]; 
www . broad . mit . edu/ gsea/ doc/doc jLndex . html) ; [37] . 

Methods proposed thus far suffer from a number of limitations, related, to a large 
extent, to the absence of a clear and precise statement of the statistical inference 
question. As a result, the analyses often lack statistical rigor and tend to be ad hoc 
and dataset-specific. 

One of our main contributions is the systematic and precise translation of a 
general class of biological questions into a corresponding class of multiple hypothesis 
testing problems. A key step in this process is the proper definition of the gene- 
annotation profiles, gene-parameter profiles, and association parameters of interest. 
This general formulation then allows us to apply the multiple hypothesis testing 
methodology developed in [14] and related articles [8, 15, 16, 31, 32, 33, 34, 39, 
40, 41, 42], to control a broad class of Type I error rates, defined as generalized 
tail probabilities (gTP), gTP{q,g) = PT{g{Vn,Rn) > <?), and generahzed expected 
values (gEV), gEV{g) ~ 'Ei[g{Vm Rn)], for arbitrary functions g{Vn,Rn) of the 
numbers of false positives Vn and rejected hypotheses i?„. 

We wish to emphasize the crucial and often ignored distinction between: (i) defin- 
ing a parameter of interest, measuring the association between gene-annotation and 
gene-parameter profiles, i.e., the statistical formulation of the biological question; 
(ii) making inferences^ i.e., estimating and testing hypotheses concerning this pa- 
rameter, based on a sample drawn from the population under consideration. Most 
methods proposed to date focus on (ii), without providing a clear statement of the 
question being answered in (i), that is, various estimation and testing procedures 
are proposed for an undefined parameter of interest. 

Due to its general and rigorous statistical framework, our approach to multiple 
tests of association with biological annotation metadata differs in a number of 
important ways from current approaches, such as those developed for inference 
with Gene Ontology metadata and implemented in the software packages listed on 
the Gene Ontology Tools webpage (www.geneontology.org/GO.tools.shtml). 

General gene-annotation profiles. Existing approaches typically consider bi- 
nary gene- annotation profiles, e.g., vectors of indicators of GO term annotation. 
Our general definition of gene-annotation profiles allows consideration of arbi- 
trary qualitative and quantitative fixed features of a genome, e.g., membership 
of genes to any number of pathways or clusters, exon/intron counts/lengths/ 
nucleotide distributions, mean transcript levels. 

General gene-parameter profiles. Existing approaches typically consider bi- 
nary gene-parameter profiles, e.g., vectors of indicators of differential expression. 
Our general definition of gene-parameter profiles allows consideration of a much 
broader class of testing problems, concerning arbitrary qualitative and quanti- 
tative parameters, such as, differences in mean expression levels or regression 
coefficients relating expression levels to clinical outcomes. 

Estimated gene-parameter profiles. Existing approaches typically assume 
known gene-parameter profiles. For example, the list of differentially expressed 
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genes from a microarray experiment is usually treated as known and fixed in 
subsequent analyses with GO, while in fact it corresponds to an unknown and 
estimated parameter. Distinguishing between the definition of a parameter and 
inference concerning this parameter, as in Section 3, provides a more rigorous 
and general formulation of the statistical question. 
General tests of association. Common approaches to tests of association with 
GO annotation arc typically limited to tests of independence in 2 x 2 contingency 
tables (e.g., based on the hypcrgcomctric distribution, Fisher's exact test). As 
in Table 2, rows correspond to gene annotation with a given GO term (fixed 
binary gene-annotation profile) and columns to a gene property of interest, such 
as differential expression (treated as a fixed binary gene-parameter profile) . The 
approach proposed in Section 3 allows consideration of a broader class of biologi- 
cal testing problems, while properly accounting for the fact that gene-parameter 
profiles are usually unknown and replaced by a random (i.e., data-driven) esti- 
mator. 

1.3. Outline 

This article proposes a general and formal statistical framework for multiple tests 
of association with biological annotation metadata, using the multiple hypothesis 
testing methodology developed in [14] and related articles. 

Section 2 provides an introduction to multiple hypothesis testing. Section 3 
presents the proposed statistical framework for multiple tests of association with 
biological annotation metadata and discusses in detail the main components of 
the inference problem, namely, the gene-annotation profiles, the gene-parameter 
profiles, and the association parameters. Multiple testing procedures (MTP) for 
tests of association between gene-annotation profiles and gene-parameter profiles 
are outlined. Section 4 gives an overview of the Gene Ontology (GO) and R soft- 
ware for accessing and analyzing GO annotation metadata (e.g., for assembling GO 
gene-annotation profiles). The proposed statistical and computational methods are 
illustrated in Section 5, using the acute lymphoblastic leukemia (ALL) microarray 
dataset of [f3], with the aim of relating GO annotation to differential gene expres- 
sion between B-cell ALL with the BCR/ABL fusion and cytogenetically normal 
NEG B-cell ALL. Finally, Section 6 summarizes our findings and outlines ongoing 
work. 

2. Multiple hypothesis testing 

This section, based on Chapter 1 of [f4], introduces a general statistical framework 
for multiple hypothesis testing and summarizes in turn the main ingredients of a 
multiple testing problem, including: the data generating distribution; the parame- 
ters of interest; the null and alternative hypotheses; the test statistics; the rejection 
regions (i.e., cut-offs) for the test statistics; errors in multiple hypothesis testing; 
Type I error rate and power; adjusted p-values. 

The section also provides an overview of multiple testing procedures developed 
in [14, Chapters 2-7] , for controlling generalized tail probability and expected value 
error rates, including the key choices of a joint null distribution and rejection regions 
for the test statistics. 

The reader is referred to our book and articles for further detail on the multiple 
testing methodology, its software implementation, and its application to a variety 
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of testing problems in biomedical and genomic research [5, 6, 7, 8, 9, 14, 15, 16, 26, 
31, 32, 33, 34, 39, 40, 41, 42]. 

2.1. Null and alternative hypotheses 

Hypothesis testing is concerned with using observed data to make decisions regard- 
ing properties of (i.e., hypotheses for) the unknown data generating distribution. 

Let X„ = {Xi : i = 1, . . . ,n} denote a random sample of n independent and 
identically distributed (IID) random variables from a data generating distribution 

P, i.e., Xi P, i = 1, . . . , n. Suppose that the data generating distribution P is 
an element of a particular statistical model A4, i.e., a set of possibly non-parametric 
distributions. Let P„ denote the empirical distribution corresponding to the sample 
Xn, i.e., the distribution which places probability 1/n on each realization of X. 

In order to cover a broad class of testing problems, define M pairs of null and 
alternative hypotheses in terms of a collection of M submodels, A4{m) Q M, m ~ 
1, . . . , A/, for the data generating distribution P. Specifically, the M null hypotheses 
and corresponding alternative hypotheses are defined as 

(1) Haim) = 1{P e Mim)) and i/i(m) = I (P ^ 7W(m)) , 
respectively. 

The general submodel representation accommodates tests of means, quantiles, 
covariances, correlation coefficients, and regression coefficients in linear and non- 
linear models (e.g., logistic, survival, time-series models). 

In many testing problems, the submodels concern parameters, i.e., functions 
\E'(P) =•(/) = {il}{m) : m = 1, . . . , M) e MJ^^ of the data generating distribution P, 
and each null hypothesis Ho{m) refers to a single parameter, ip{m) = 5'(P)(m) G 
M. One distinguishes between two types of testing problems for such parametric 
hypotheses: one-sided tests, 

(2) Ho{m) = l{^P{m)<Mm)) 

vs. Hi{m) = l{'i(j{m) > ipo{m)) , m=l,...,M, 

and two-sided tests, 

(3) Fo(m) = I(V'(m)=V'o(m)) 

vs. Hi{m) = l{'i(j{m) ^ tpoim)) , m = l,...,M. 

The hypothesized null values, ipQ{m), are frequently zero. 
Let 

(4) Ho = HoiP) = {m : i7o(m) = 1} = {m : P e Mim)} 

denote the set of /iq = |Ho| true null hypotheses, where the longer notation 'Hq{P) 
emphasizes the dependence of this set on the data generating distribution P. Like- 
wise, let 

(5) Hi ^ HiiP) = {m : Hi{m) = 1} = {m : P ^ M{m)} = Hg(P) 

be the set of hi = |7ii \ — M — ho false null hypotheses. 

The goal of a multiple testing procedure is to accurately estimate, i.e., reject, 
the set Hi, while probabilistically controlling false positives. 
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2.2. Test statistics 

A testing procedure is a random or data-driven rule for estimating the set of false 
null hypotheses TLi = {m : Ho{m) = 0} = {m : P ^ Ai{m)}, i.e., for deciding which 
null hypotheses should be rejected. 

The decisions to reject or not the null hypotheses are based on an M-vector of test 
statistics, Tn = {Tn{m) : m = 1, . . . , A/), that arc functions T„(m) = T{m; Xn) = 
T{m; Pn) of the data Xn, i.e., of the empirical distribution P„. Denote the typically 
unknown (finite sample) joint distribution of the test statistics T„ by Q„ ~ Qn{P)- 

As in [14, Chapter 1], for the test of single-parameter null hypotheses of the 
form Ho(rn) = I {ipi'm) < ipo{m)) or Ho{m) = I {ip{m) = ipQ{m)), m = 1, . . . , M, 
consider two main types of test statistics, difference statistics, 

(6) Tn{ni) = Estimator — Null value = ^/^[ijjnijn) — ipo{ni)), 
and t-statistics (i.e., standardized differences). 

Estimator — Null value ^^ni'm) — "fpoii^) 

(7) T„(m) = c.....^„„^ = V^- 



Standard error <Jn{m) 

Here, ^{Pn) ~ ipn = {4'n{m) : m = 1, . . . , M) denotes an estimator for the param- 
eter ^{P) = -ip = {ip(m) : m ^ 1, . . . , M) and {anim)/y/n : m = I,. ..,M) denote 
the estimated standard errors for elements ipnim) of ipn- 

Further suppose an asymptotically linear estimator ipn of the parameter tp, 
with M-dimensional vector influence curve (IC) IC{X\P) = {IC{X\P){m) : m = 
1, . . . , M), such that 

n 

(8) ^„(m)-V'(™) = - V/C(X,|P)(m) + op(l/V^) 

n ^-^ 

1=1 

and E[IC{X\P){m)] = 0, for each m = 1,...,M. Let S(P) = a = {a{m,m') : 
m, m' ~ 1, . . . , M) denote the Al x M covariance matrix of the vector influence 
curve /C(X|P), where cr(TO, to') ee E[/C(X|P)(m)/C(A:|P)(TO')] and we may adopt 
the shorter notation a'^{m) ~ a{m,m) = 'E[IC'^{X\P){m)] for variances. Assume 
that cr^j(m) arc consistent estimators of the IC variances a'^{m). 

The influence curve of a given estimator can be derived as its mean-zero-centered 
functional derivative (as a function of the empirical distribution P„ for the entire 
sample of size n), applied to the empirical distribution for a sample of size one 
[19, 20]. 

This general representation for the test statistics covers standard one-sample and 
two-sample t-statistics for testing hypotheses concerning mean parameters, but also 
test statistics for correlation coefficients and regression coefficients in linear and non- 
linear models. Test statistics for other types of null hypotheses include P-statistics, 



2.3. Rejection regions 

A multiple testing procedure (MTP) provides rejection regions Cn(rn), i.e., sets of 
values for each test statistic T„(to) that lead to the decision to reject the corre- 
sponding null hypothesis Ho{m) and declare that P ^ A4{m), m = 1, . . . ,M. In 
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other words, a MTP produces a random (i.e., data-dependent) set TZn of rejected 
hypotheses that estimates the set Hi of false nuU hypotheses, 

(9) TZn = TZ{T„, Qan,a) = {m : Tn{m) G Cn(m)} ^ {m : Ho{m) is rejected} , 

where C„(m) = C(m; T„, Qom a), m = 1, . . . , Af , denote possibly random rejection 
regions. 

The long notation TZ{Tn, Qom ct) and C(m; T„, Qon, a) emphasizes that the MTP 
depends on the following three ingredients: 

• the data, Xn ~ {Xi : i = l,...,n}, through the A/-vector of test statistics, 
T„ ^ {Tn{m) : TO 1, . . . , M) (Section 2.2); 

• an (estimated) M-variate test statistics null distribution, Qom which replaces the 
unknown true test statistics distribution Qn = Qn (P) for the purpose of deriving 
rejection regions, confidence regions, and adjusted p- values (Section 2.7); 

• the nominal Type I error level a, i.e., a user-supplied upper bound for a suitably 
defined Type I error rate (Section 2.5). 

Given a proper test statistics null distribution Qo (or estimator thereof, Qon), 
the main task is to specify rejection regions for each null hypothesis, so that the 
resulting procedure probabilistically controls Type I errors. We consider MTPs 
based on nested rejection regions, that is, 

(10) C{m;T„, Qon, ai) ^ C{m;Tn,Qon,a2), whenever a i < q;2- 

Rejection regions are typically defined in terms of intervals, such as, C„(to) = 
(u„(m), -l-oo), C„(m) = {~oo, l„{m)), or C„(m) = (-oo, Z„(to)) U (u„(to), -|-oo), 
where /,i(m) = /(to; T„, Qoti, ce) and Un{m) = u{m; Tn, Qom oi) are to-be-determined 
lower and upper critical values, or cut-offs, computed under the null distribution 
Qon for the test statistics r„. Two-sided rejection regions of the form C„(to) = 
(— oo, /„(m))U(it„(m), -l-oo) allow the use of asymmetric cut-offs for two-sided tests. 

Unless specified otherwise, we assume that large values of the test statistic 
T„(to) provide evidence against the corresponding null hypothesis Holm), that is, 
we consider one-sided rejection regions of the form C„(m) = (c„(m), +oo), where 
c„(m) — c{m;Tn, Q On, ct). For two-sided tests of single-parameter null hypotheses 
using difference or ^-statistics, as in Equations (6) and (7), one could take absolute 
values of the test statistics. 

Among the different approaches for defining rejection regions, we distinguish the 
following. 

Marginal vs. joint multiple testing procedures. In marginal multiple test- 
ing procedures, rejection regions are based solely on the marginal distributions 
of the test statistics (e.g., FWER-controUing single-step Bonferroni procedure 
[12]). In contrast, joint procedures take into account the dependence structure 
of the test statistics (e.g., FWER-controUing single-step maxT Procedure 1). 
Joint MTPs tend to be more powerful than marginal MTPs. 
Note that while a procedure may be marginal, proof of Type I error control by 
this MTP may require certain assumptions on the dependence structure of the 
test statistics (e.g., FWER-controlling step- up Hochberg procedure [23]). 

Single-step vs. stepv^rise multiple testing procedures. In single-step multi- 
ple testing procedures, each null hypothesis Ho(m) is tested using a rejection 
region that is independent of the results of the tests of other hypotheses and is 
not a function of the data Xn (unless these data are used to estimate the null dis- 
tribution). In contrast, in stepwise procedures, the decision to reject a particular 
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null hypothesis depends on the outcome of the tests of other hypotheses. That is, 
the (single-step) testing procedure is applied to a sequence of successively smaller 
nested random (i.e., data-dependent) subsets of null hypotheses, defined by the 
ordering of the test statistics (common-cut-off MTPs) or unadjusted p-values 
(common-quantile MTPs). The rejection regions arc therefore allowed to depend 
on the data X„ via the test statistics r„. 

Stepwise procedures are of two main types, depending on the order in which the 
null hypotheses are tested. In step-down procedures, the most significant null 
hypotheses (i.e., the null hypotheses with the largest test statistics for common- 
cut-off MTPs or smallest unadjusted p-values for common-quantile MTPs) are 
considered successively, with further tests depending on the outcome of earlier 
ones. As soon as one fails to reject a null hypothesis, no further hypotheses are 
rejected. In contrast, for step-up procedures, the least significant null hypotheses 
are considered successively, again with further tests depending on the outcome 
of earlier ones. As soon as one null hypothesis is rejected, all remaining more 
significant hypotheses are rejected. 

Stepwise MTPs tend to be more powerful than single-step MTPs. 
Common-cut-ofF vs. common-quantile multiple testing procedures. In 

common- cut- off multiple testing procedures, the same cut-off cq is used for each 
test statistic (e.g., FWER-controlling single-step and step-down maxT proce- 
dures, based on maxima of test statistics). In contrast, in common-quantile pro- 
cedures, the cut-offs are the (5o-quantilcs of the marginal null distributions of the 
test statistics (e.g., FWER-controlling single-step and step-down minP proce- 
dures, based on minima of unadjusted p- values). 

The latter p- value-based procedures place the null hypotheses on an "equal foot- 
ing", i.e., are more balanced than their common-cut-off counterparts, and may 
therefore be preferable. However, this comes at the expense of increased compu- 
tational complexity. 

2.4- Errors in multiple hypothesis testing 

In any testing problem, two types of errors can be committed. A Type I error, or 
false positive, is committed by rejecting a true null hypothesis {TZn n Ho). A Type 
11 error, or false negative, is committed by failing to reject a false null hypothesis 

The situation can be summarized as in Table 1, where the number of rejected 
null hypotheses is 

M 

(11) Rn = = ^ I (r„(m) e Cn{m)) , 

m—l 

the number of Type I errors or false positives is 

(12) K = \n„ nno\^ J2 ^ (^»(™) ^ ^"(™)) ' 

me Wo 

the number of Type II errors or false negatives is 



(13) 



[/„EE |7^^n7^l| - ^{Tn{m)iCn{m)), 

meHi 
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Table 1. Type I and Type II errors in multiple hypothesis testing. This table summarizes the 
different types of decisions and errors in multiple hypothesis testing. The number of rejected null 
hypotheses is i?„ = |7?.„|, the number of Type I errors or false positives is V„ = {TZ^ n Ti-ol, 
number of Type II errors or false negatives is Un = \T^n ^ the number of true negatives is 
Wn = \T^n ^ Woli and the number of true positives is Sn = \TZn <~[ Cells corresponding to 
errors are enclosed in boxes. 



Null hypotheses 



True, Ho 



False, Hi 



Null hypotheses 
Non-rejected, TZ'^ Rejected, TZn 



M-R„ 



Vn = ITlnnHal 



Rn 



ho 

hi 
M 



the number of true negatives is 

(14) Wn = \K n T^ol - ^ I (Tnim) i C„(m)) = M - i?„ - C/„ = - 
and the number of true positives is 

(15) Sn = |7^„ n 7^1 1 = ^ I (T„(m) € C„(m)) = i?„ - K = hi - (7„. 

meHi 

Note that Sn, Un, Vn, and Wn each depend on the unknown data generating 
distribution P through the unknown set of true null hypotheses Ha = 7io(-P)- 
Therefore, the numbers ho = |?io| and hi = \Ti.i\ = M — ho of true and false 
null hypotheses are unknown parameters (row margins of Table 1), the number of 
rejected hypotheses i?„ is an observable random variable (column margins of Table 
1), and Sn, Un, Vn, and Wn are unobservable random variables (cells of Table 1). 

Ideally, one would like to simultaneously minimize both the number of Type I 
errors and the number of Type II errors. Unfortunately, this is not feasible and one 
seeks a trade-off between the two types of errors. A standard approach is to specify 
an acceptable level a for a suitably defined Type I error rate and derive testing 
procedures (i.e., rejection regions) that aim to minimize a Type II error rate (i.e., 
maximize power) within the class of tests with Type I error level at most a. 



2.5. Type I error rate and power 



When testing multiple hypotheses, there are many possible definitions for the Type 
I error rate and power of a testing procedure. Accordingly, we define a Type I error 
rate as a parameter 0„ = 0(-FV„.fl„) of the joint distribution Fv^.Rn of numbers 
of Type I errors Vn = [Rn H Ho| and rejected hypotheses i?„ = |7^n |. Likewise, we 
define power as a parameter i9„ = 0{Fij„M.-a) of the joint distribution Fu^^r^ of 
the numbers of Type II errors [/„ = \R'^ n TLi\ and rejected hypotheses i?„ = |7?.„|. 
We focus primarily on mappings such that On G [0, 1] and i?„ G [0, 1]. 

This parametric representation covers a broad class of Type I error rates, includ- 
ing generalized tail probability (gTP) error rates. 



(16) 



gTP{q,g) = Vv{g{Vn,Rn) > q). 
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and generalized expected value (gEV) error rates, 
(17) gEV{g) = n9{Vn,Rn)], 

for arbitrary functions g{Vn,Rn) of the numbers of Type I errors Vn and rejected 
hypotheses i?„ and user-supphed bounds q. 

Generahzed tail probabiUty and expected value error rates include as special 
cases the following commonly-used Type I error rates. 

The generalized family-wise error rate (gFWER), corresponding to g{v, r) = v and 
q £ {0, . . . , {Hq — 1)}, is the probability of at least (g + 1) Type I errors, 

(18) gFWER{q) = Pr(K > g) = 1 - Fy„ [q). 

When (7 = 0, the gFWER reduces to the usual family-wise error rate (FWER), 
controlled by the classical Bonferroni procedure. 
The tail probability for the proportion of false positives (TPPFP) among the rejected 
hypotheses, corresponding to g{v,r) = v/r and q G (0, 1), is defined as 

(19) TPPFP{q) = Pr > = 1 - Fv^/B.^ (g), 

with the convention that Vn/Rn = if i?„ = 0. 
The false discovery rate (FDR), corresponding to g{v,r) = v/r, is the expected 
proportion of false positives among the rejected hypotheses. 



(20) FDR = E 



= j qdFv^/B.S<i)^ 



R„ 

again with the convention that Vn/Rn = if Rn = 0. 

Error rates Q{Fy^jji„); based on the proportion of false positives (e.g., TPPFP 
and FDR) , are especially appealing for the large-scale testing problems encountered 
in genomics, compared to error rates 0(i^y^), based on the number of false positives 
(e.g., gFWER); as they do not increase exponentially with the number AI of tested 
hypotheses. However, error rates B(iV„/i?,„) tend to be more difficult to control 
than error rates 0(i<V„), as they are based on the joint distribution of Vn and _R„, 
rather than only the marginal distribution of Vn- 



2. 6. Adjusted p-values 

As in the case of single hypothesis testing, one can report the results of a multiple 
testing procedure in terms of the following quantities: rejection regions for the test 
statistics, confidence regions for the parameters of interest, and adjusted p-values. 

Adjusted p-values, for the test of multiple hypotheses, are defined as straightfor- 
ward extensions of unadjusted p-values, for the test of individual hypotheses. Con- 
sider any multiple testing procedure TZn(cit) = TZ{Tn, Qo,a), with rejection regions 
Cn{rn; a) = C{m; T„, Qo, a). Then, one can define an M-vector of adjusted p-values, 
Pon = {Pon im) : m = 1, . . . , A/) = F(T„, Qo) = ^^(7^(T„, Qo, a) : a e [0, 1]), as 

(21) Po„(m) = inf{Q e [0, 1] : Reject i/o (to) at nominal MTP level a} 
= inf {q e [0, 1] ; m G 7^„(a)} 

= inf{Q e [0,1] : T„(m) eC„(m; a)}, m = l,...,M. 
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That is, the adjusted p- value Pon('Ti), for null hypothesis Ho{m), is the smallest 
nominal Type I error level (e.g., gFWER, TPPFP, or FDR) of the multiple hypoth- 
esis testing procedure at which one would reject Ho^m), given r„. 

As in single hypothesis tests, the smaller the adjusted p- value P()n{m), the 
stronger the evidence against the corresponding null hypothesis Hq^tti). Thus, one 
rejects Ho{m) for small adjusted p- values Ponim). 

For instance, the adjusted p- values for the classical FWER-controlling marginal 
single-step common-quantile Bonferroni procedure are Ponim) = 
min {MPo„(m), 1}. Adjusted p-values for FWER-controlling joint single-step com- 
mon-cut-off maxT Procedure 1 are given in Equation (26). 

Under the ncstcdness assumption of Equation (10), one has two equivalent rep- 
resentations for a MTP, in terms of rejection regions for the test statistics and 
in terms of adjusted p-values. Specifically, the set of rejected null hypotheses at 
multiple test nominal Type I error level a is 



Reporting the results of a MTP in terms of adjusted p-values, as opposed to 
only rejection or not of the null hypotheses, offers several advantages, including the 
following. 

• Adjusted p- values can be defined for any Type I error rate (e.g., gFWER, TPPFP, 
or FDR). 

• They reflect the strength of the evidence against each null hypothesis in terms 
of the Type I error rate for the entire MTP. 

• They are flexible summaries of a MTP, in the sense that results are supplied for 
all Type I error levels a, i.e., the level a need not be chosen ahead of time. 

• They provide convenient benchmarks to compare different MTPs, whereby smaller 
adjusted p- values indicate a less conservative procedure. 

• Plots of sorted adjusted p-values allow investigators to examine sets of rejected 
hypotheses associated with various Type I error rates (e.g., gFWER, TPPFP, or 
FDR) and nominal levels a. Such plots provide tools to decide on an appropriate 
combination of the number of rejected hypotheses and tolerable false positive 
rate for a particular experiment and available resources. 

2. 7. Test statistics null distribution 

As detailed in Chapter 2 of [14], a key feature of our proposed multiple testing 
procedures is the test statistics null distribution (rather than data generating null 
distribution) used to obtain rejection regions for the test statistics, confidence re- 
gions for the parameters of interest, and adjusted p-values. Indeed, whether testing 
single or multiple hypotheses, one needs the (joint) distribution of the test statistics 
in order to derive a procedure that probabilistically controls Type I errors. In prac- 
tice, however, the true distribution Q„ = Qn{P) of the test statistics T„ is unknown 
and replaced by a null distribution Qq. The choice of a proper null distribution is 
crucial, in order to ensure that (finite sample or asymptotic) control of the Type 
I error rate under the assumed null distribution does indeed provide the desired 
control under the true distribution. This issue is particularly relevant for large-scale 
testing problems, such as those involving biological annotation metadata, which 
concern high-dimensional multivariate distributions, with complex and unknown 
dependence structures among variables. 



(22) 
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Chapter 2 of [14] provides a general characterization of a proper test statistics 
null distribution in terms of null domination conditions for the joint distribution of 
the test statistics (T„(m) : m e Hq) for the true null hypotheses Ho (Section 2.2). 
This general characterization leads to the explicit proposal of the following two 
main types of test statistics null distributions: a null shift and scale-transformed 
test statistics null distribution, based on user-supplied upper bounds for the means 
and variances of the test statistics for the true null hypotheses (Section 2.3), and 
a null quantile-transformed test statistics null distribution, based on user-supplied 
marginal test statistics null distributions (Section 2.4). 

In practice, the test statistics null distribution Qq = Qo{P) is unknown, as it 
depends on the unknown data generating distribution P. Resampling procedures 
based on the bootstrap arc provided to conveniently obtain consistent estimators 
of the null distribution and of the corresponding test statistic cut-offs, parameter 
confidence regions, and adjusted p- values [14, Sections 2.3.2, 2.4.2, 4.4, 5]. 

As argued in [14, Chapter 2], the following two main points distinguish our 
approach from existing approaches to Type I error control and the choice of a test 
statistics null distribution (e.g., [24] and [45]). Firstly, we are only concerned with 
control of the Type I error rate under the true data generating distribution P, i.e., 
under the joint distribution Qn = Qn{P), implied by P, for the test statistics r„. 
The concepts of weak and strong control of a Type I error rate and the related 
restrictive assumption of subset pivotality are therefore irrelevant in our context 
[45, p. 9-10, 42-43]. Secondly, we propose a null distribution for the test statistics 
{Tn ~ Qo) rather than a data generating null distribution {X ^ Pq € n^^j7W(m)). 
The latter practice does not necessarily provide proper Type I error control under 
the true distribution P. Indeed, the test statistics assumed null distribution Qn {Po) 
and their true distribution Q„ (P) may have different dependence structures for the 
true null hypotheses Tio and, as a result, may violate the required null domination 
conditions for Type I error control. 

We stress the generality of our proposed test statistics null distributions: Type 
I error control does not rely on restrictive assumptions such as subset pivotality 
and holds for general data generating distributions (with arbitrary dependence 
structures among variables), null hypotheses (defined in terms of submodels for 
the data generating distribution), and test statistics (e.g., t-statistics, x^-statistics, 
F-statistics). 

2.7.1. Null shift and scale-transformed test statistics null distribution 

The first original null distribution of [IG, 33, 41], is defined as the asymptotic 
distribution Qo = Qo{P) of the M- vector Zn of null shift and scale-transformed 
test statistics, 



where Aq (m) and Tq (m) are, respectively, user-supplied upper bounds for the means 
and variances of the Ho-specific test statistics. 

In this construction, the null shift values Ao(m) are chosen so that the Ho-specific 
statistics (Z„(m) : m € Ho) are asymptotically stochastically greater than the 
original test statistics {Tn{m) : m G Ho). The resulting null distribution therefore 
satisfies the required null domination conditions for Type I error control. 



(23) 




Multiple tests of association with biological annotation metadata 



167 



In contrast, the null scale values To(m) are not needed for Type I error control. 
The purpose of tq (m) is to avoid a degenerate null distribution and infinite cut-offs 
for the false null hypotheses (m G Tli), an important property for power considera- 
tions. This scaling is needed, in particular, for i<"-statistics that have asymptotically 
infinite means and variances for non-local alternative hypotheses. 

For a broad class of testing problems, such as the test of single-parameter 
null hypotheses using t-statistics (Equation (7)), the null distribution Qo is an 
Af-variate Gaussian distribution, with mean vector zero and covariance matrix 
(T* = that is, Qo = N(0, cr*). For tests where the parameter of interest is 

the M-dimcnsional mean vector ^'(P) — ij; = E[X], the estimator ipn is simply the 
M- vector of empirical means and cr* = E*(P) = Cor[X] is the correlation matrix 
oi X ^ P, that is, Qq ~ N(0, Cor[X]). More generally, for an asymptotically linear 
estimator ^/;„, is the correlation matrix of the vector influence curve. This 

situation covers standard one-sample and two-sample t-statistics for tests of means, 
but also test statistics for correlation coefficients and regression coefficients in linear 
and non-linear models. 

For testing the equality of K population mean vectors using F -statistics, an 
_F-statistic-specific null distribution may be defined as the joint distribution of an 
M-vector of quadratic forms of Gaussian random variables. 

2.7.2. Null quantile-transformed test statistics null distribution 

The second and most recent proposal of [42] is defined as the asymptotic distribution 
Qo = Qo{P) of the M-vector Z„ of null quantile-transformed test statistics, 

(24) Z„(m) ^ qoXQtniTnim)), 

where go.m are user-supplied marginal test statistics null distributions that satisfy 
marginal null domination conditions. According to the generalized quantile-quantile 
function transformation of [4()], define Qnmi^) = '^Qn.m{z) -f (1 — A)Qn^m{z~), 
where Qn,m are the marginal distributions of the test statistics T„(m) and the 
random variable A is uniformly distributed on the interval [0, 1], independently of 
the data Xn- 

This latest proposal has the additional advantage that the marginal test statistics 
null distributions may be set to the optimal (i.e., most powerful) null distributions 
one would use in single hypothesis testing (e.g., permutation marginal null distri- 
butions, Gaussian or other parametric marginal null distributions). 

2. 8. Overview of multiple testing procedures 

Having identified a suitable test statistics null distribution Qq (or estimator thereof, 
Qon), there remains the main task of specifying rejection regions (i.e., cut-offs) for 
the test statistics, confidence regions for the parameters of interest, and adjusted 
p- values. 

As detailed in [14, Chapters 3-7], we have developed resampling-based single-step 
and stepwise multiple testing procedures for controlling a broad class of Type I er- 
ror rates, defined as generalized tail probabilities, gTP{q,g) = Y'r{g{Vn,Rn) > q), 
and generalized expected values, gEV{g) = Fi[g{Vn, Rn)], for arbitrary functions 
g{Vn,Rn) of the numbers of false positives Vn and rejected hypotheses Our 
proposed procedures take into account the joint distribution of the test statistics 
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and provide Type I error control in testing problems involving general data gen- 
erating distributions (with arbitrary dependence structures among variables), null 
hypotheses (defined in terms of submodels for the data generating distribution), 
and test statistics (e.g., ^-statistics, x^-statistics, F-statistics). 

An overview of available MTPs is provided in Chapter 3 of [14]. Core method- 
ological Chapters 4-7 discuss the following main approaches for deriving rejection 
regions. 

Joint single-step common- cut- off and common- quantile procedures for controlling 
general Type I error rates Q[FvJ)^ defined as arbitrary parameters of the dis- 
tribution of the number of Type I errors Vn (Chapter 4 in [14], [16, 33]). Er- 
ror rates of the form 0(Fv,J include the generalized family- wise error rate, 
gFWER{q) = 1 - FvSl) = Pr(14 > q), i-e., the chance of at least {q + 1) 
Type I errors. 

Joint step-down common- cut- off (maxT) and common- quantile (minP) procedures 
for controlling the family-wise error rate, FWER = gFWER{0) = 1 — iV„ (0) = 
Fr{Vn > 0) (Chapter 5 in [14], [41]). 

(Marginal/joint single-step/stepwise common-cut-off/common-quantile) augmenta- 
tion multiple testing procedures (AMTP) for controlling generalized tail probabil- 
ity error rates, based on an initial gFWER-controUing procedure 
(Chapter 6 in [14], [1-5, 40]). 

Joint resampling-based empirical Bayes procedures for controlling generalized tail 
probability error rates (Chapter 7 in [14], [39]). 

The above multiple testing procedures are implemented in the Bioconductor R 
package multtest ([14, Section 13.1]; [32]; www.bioconductor.org). 

2.9. FWER- controlling single-step common- cut- off maxT procedure 

This section focusses on control of the family-wise error rate, using the single- 
step maxT procedure, a common-cut-off procedure exploiting the joint distribution 
of the test statistics. The method is summarized below; details are given in [14, 
Chapter 4] and [1(1]. 



Procedure 1 [FWER-controlling single-step common-cut-ofF maxT 
procedure] . 

Given an M -variate test statistics null distribution Qo, the single-step common- 
cut- ojj maxT procedure is based on the distribution of the maximum test statistic, 
max„i Z{m), for the M-vector Z = [Z{m) : to = 1, . . . , M) ~ Qo- For controlling 
the FWER at nominal level a G (0, 1), the common cut-off c{Qo, a) is defined as 
the (1 — a)-quantile of the distribution o/ max,„ Z{m), that is, 



The adjusted p-value Ponim) for null hypothesis Ho^m) is the probability, under 
Qo, that max„i Z{m) exceeds the corresponding observed test statistic tn{m), that 
is. 



(25) 




(26) 




TO= 1,...,M. 
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Procedure 1 provides proper FWER control when based on either of the two nuU- 
transformed test statistics null distributions Qq introduced in Section 2.7. Consis- 
tent estimators Qon of the null distribution Qo and corresponding single-step maxT 
cut-offs and adjusted p- values may be obtained using the bootstrap, as in Procedure 
2.9, below, for the null shift and scale-transformed test statistics null distribution 
[14, Section 4.4]. 



Procedure 2 [FWER-controlling bootstrap-based single-step 
common-cut-ofF maxT procedure]. 

1. Let P* denote a bootstrap estimator of the data generating distribution P. 
For the non-parametric bootstrap, P* is simply the empirical distribution P„, 
that is, samples of size n are drawn at random, with replacement from the 
observed data Xn = {^i : ? = 1, . . . ,n}. For the model-based bootstrap, P* 
belongs to a model M for the data generating distribution P , such as a family 
of multivariate Gaussian distributions. 

2. Generate B bootstrap samples, = {X^ : i = 1, . . . , n}, b = 1, . . . , B . For 
the bth sample, the Xf , i = 1, . . . ,n, are IID according to P*. 

3. For each bootstrap sample X^, compute an M -vector of test statistics, 
T^f {■, b) ~ {T^ {m, b) : m ~ 1, . . . , M), that can be arranged in an M x B ma- 
trix, = (T^ {m, b) : m — 1, . . . , M; b — 1, . . . , B) , with rows corresponding 
to the M null hypotheses and columns to the B bootstrap samples. 

4. Gompute row means and variances of the matrix T^, to yield estimators of 
the means, E[T„(to)], and variances, Var[T„(m)], of the test statistics under 
the data generating distribution P. 

That is, compute 

(27) E[T„^(m,.)] 
Var[r„^(m,.)] 

5. Obtain an M x B matrix, = {Z^{m, b) : m = 1, . . . , M; b ^ 1, . . . , B) , 
of null shift and scale-transformed bootstrap test statistics Z^{m,b), by row- 
shifting and scaling the matrixTf^ using the bootstrap estimators o/E[r„(TO)] 
and Var[T„(TO)] and the user-supplied null values Xo{rn) and to(to). That is, 
define 

(28) 

Z^im,b) ^ ^min|l,^^-^^^| (r„^(m, 6) - E[r„^(m, •)]) + Ao(m). 

For t- statistics as in Equation (7), the null values are Ao(m) = andro^m) = 
1. 

6. The bootstrap estimator Qon of the null shift and scale-transformed null dis- 
tribution Qo is the empirical distribution of the B columns {Z:^[-,b) : b — 
1, . . . , B} of matrix . 

7. For each column b of the matrix Z^ (i.e., bootstrap sample b), compute the 
maximum statistic, max^ Z:l^(m,, b), b = 1, . . . , B. 



b=l 

^ ;^E(r„^K 6) -E[T„^(m, •)])'. 

b=l 
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For controlling the FWER at nominal level a G (0, 1), the bootstrap single- 
step maxT common cut-off c{Q om a) is the (1 — a)-quantile of the empirical 
distribution of the B maxima {max,„ Z^{in, b) : b = 1, . . . , B} , that is, 
(29) 



c(Qon, a) = inf i z e M : ^ I ( mi 

r> ^ — ^ \ m—l. 



ma^^^Z^{m,b) <z] > (1 - a) 



9. The bootstrap single-step maxT adjusted p-value pori(™) for null hypothesis 
H[){m) is the proportion of maxima {ma.x„i {m,b) : b = 1,...,B} that 
exceed the corresponding observed test statistic tn{m), that is, 



(30) Po«(to) = 7j ( max Z^(m, 5) > t„(m) ) , 

6=1 ^ ^ 



TO = 1, . . . ,M. 



3. Statistical framework for multiple tests of association with biological 
annotation metadata 

Sections 3.1-3.3 introduce the main components of our approach to multiple tests 
of association with biological annotation metadata, namely, the gene-annotation 
profiles A, the gene-parameter profiles A, and the association measures i/j = p(^. A) 
between gene- annotation and gene-parameter profiles. We stress that the choice of a 
suitable association parameter ■0 is perhaps the most important and hardest aspect 
of the inference problem, as this parameter represents the statistical translation of 
the biological question of interest. Once the association parameter -i/; is appropriately 
and precisely defined, one can rely on a variety of statistical methods to estimate and 
test hypotheses concerning this parameter. Section 3.4 describes how the multiple 
testing methodology of [14] and related articles may be used to detect associations 
between gene-annotation and gene-parameter profiles. 

Note that, for the sake of illustration, we focus on gene-level features. However, 
as mentioned in Section 1.1, the methodology is generic and may be applied to 
other types of features, such as those concerning gene isoforms and proteins. 



3.1. Gene- annotation profiles 

Gene- annotation profiles refer to features of a genome that are assumed to be known 
and constant among units in a population of interest. Such features typically consist 
of gene annotation metadata, that reflect current knowledge on gene properties, 
such as, nucleotide and protein sequences, regulation, and function. 

Specifically, let A = {A{g, m) : g — 1, . . . ,G; m = 1, . . . , M) denote a G x M 
gene- annotation matrix^ providing data on M features for G genes in an organism of 
interest. Thus, row A{g, •) = {A{g, to) : to = 1, . . . , M) denotes an M-dimensional 
gene-specific feature vector for the gth gene, g = 1, . . . , G, and column A{-, to) = 
{A{g,m) : g = 1, . . . , G) denotes a G-dimcnsional gene- annotation profile for the 
TOth feature, to = 1, ... , M. 

In many applications, the element A{g,m) is a binary indicator, coding the 
YES/NO answer to the TOth question, among a collection of M questions one may 
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ask about gene g. For example, A{g,m) could indicate whether gene g is anno- 
tated with a particular GO term to, among M terms in one of the three ontologies 
(BP, CC, or MF), i.e., whether gene g is an element of the node corresponding 
to the mth term in the GO directed acyclic graph (DAG). Other gene-annotation 
profiles of interest may refer to exon/intron counts/lcngths/nucleotide distribu- 
tions, gene pathway membership (e.g., from the Kyoto Encyclopedia of Genes and 
Genomes, KEGG, www . genome . ad . jp/kegg), or gene regulation by particular tran- 
scription factors. Regarding transcription regulation, one could use data from the 
Transcription Factor DataBase (TRANSFAG, www.gene-regulation.com) to gen- 
erate gene-annotation profiles as follows. For a given transcription factor binding 
motif, a binary gene-annotation profile could consist of indicators for the presence 
or absence of the motif in the upstream control region of each gene. A continuous 
gene-annotation profile could be based on the position weight matrix of the binding 
motif. 

Note that the aforementioned features are only fixed in time for a given ver- 
sion/release of the corresponding database(s), i.e., such biological data are con- 
stantly evolving as our knowledge of the roles of genes and proteins is accumulating 
and changing. The dynamic nature of biological annotation metadata is an impor- 
tant issue in terms of software design (Section 4.2; [IS]). 

Note also that gene-annotation profiles are not restricted to be binary or even 
polychotomous and, in particular, could be continuous gene-parameter profiles, suit- 
ably estimated from previous studies. 

The main point, regarding the formulation of the statistical inference question, 
is that gene-annotation profiles are known and constant among population units. 

3.2. Gene-parameter profiles 

Gene-parameter profiles are generally unknown and concern the distribution of 
variable features of a genome in a well-defined population. Gene-specific variables of 
interest refiect cellular type/ state/ activity under particular conditions and include 
microarray measures of transcript levels and comparative genomic hybridization 
(GGH) measures of DNA copy numbers. 

Specifically, let X ~ {X{j) : j ~ 1, . . . , J) be a J-dimensional random vector, 
containing G gene-specific random variables {X{g) : g = 1,...,G). In addition 
to the G gene-specific variables, X may include various biological and clinical co- 
variates (e.g., age, sex, treatment, timepoint) and outcomes (e.g., survival time, re- 
sponse to treatment, tumor class). Let P denote the data generating distribution for 
the random J- vector X and suppose that P belongs to a (possibly non-parametric) 
model A4. 

Let the parameter mapping A : — s- M.^ define a G-dimensional gene-parameter 
profile, A(P) = A = {X{g) : g = 1, . . . ,G) eR^ , where each X{g) = A(P)(5) e M is a 
gene-specific real-valued parameter. For example, X{g) could be the mean expression 
measure E[A((7)] of gene g or a, regression coefficient relating an outcome component 
of X to the expression measure X{g) of gene g, g = 1, . . . ,G. 

While gene-annotation profiles are known and fixed, gene-parameter profiles are 
typically unknown and need to be estimated, e.g., from a microarray experiment 
involving a sample of population units. The sample X„ = {Xi : i = 1, . . . ,n} is 
assumed to consist of n independent and identically distributed (IID) copies of 
X ^ P, corresponding to n randomly sampled population imits. 
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3.3. Association measures for gene- annotation and gene-parameter 
profiles 

Let the parameter mapping 5* : ^ M^^ specify an M-dimcnsional association 
parameter vector, 

(31) = 7/. = (V'(m) : m - 1, . . . , M) = p{A, A(P)), 

defined in terms of an association measure p : MP^^'^ x R*^ R^^, known fixed 
gene- annotation profiles A, and an unknown gene-parameter profile A = 

The choice of a suitable association parameter is subject matter-dependent and 
requires careful consideration. For instance, for Gene Ontology annotation, it is 
desirable that the association parameter reflect the structure of the GO directed 
acyclic graph (Section 4.1). In principle, the dimension of the association param- 
eter vector ij} could differ from the number A/ of features under consideration. In 
addition, one could accommodate several gene-parameter profiles A. 

The various quantities in the inference problem are summarized in Figure 1; 
examples of association parameters are given next and in Section 5. 



3.3.1. Univariate association measures 



In the simplest case, one could define the M association parameters univariately, i.e., 
define 'ip{m) based only on the mth gene-annotation profile A{-, m), m = 1, . . . , M . 



Gene-annotation profiles, A 
Known, fixed GxM matrix 



Gene-parameter profile, X 
Unknown G-vector, 
to be estimated 



G genes 




G genes 



Association parameter vector 
Unknown M-vector, to be estimated 

Fig 1 . Parameters for tests of association with biological annotation metadata. This figure rep- 
resents the main ingredients involved in multiple tests of association with biological annotation 
metadata: the gene-annotation profiles, the gene-parameter profile, and the association parame- 
ters. (Color version on website companion.) 
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Specifically, for the mth feature, let 

(32) ^Sj{P){m) = V.(m) = m), A(P)), 

where p„i : M*^ x M*^ R provides a measure of association (e.g., correlation 
coefficient) between the G-dimensional gene-annotation profile A{-,m) and gene- 
parameter profile A = A(P). In many situations, the same association measure pm 
may be used for each of the M features. 

Continuous gene-annotation profiles and continuous gene-parameter pro- 
files For continuous gene-annotation and gene-parameter profiles, one may use as 
association measure the Pearson correlation coefficient between two G- vectors. That 
is, 



(33) VM 



EL(^(ff,"^)-^M)(A(5)-A) 



EUiMg, m) - Aim))\ j:%{X{g) - A)2 



where A{m) = ^ = J^gMd)/^^ denote, respectively, the aver- 

ages of the G elements of the gene-annotation profile A{-,m) and gene-parameter 
profile A. 

Binary gene-annotation profiles and binary gene-parameter profiles For 

binary gene-annotation and gcne-paramcter profiles, one may build 2x2 contin- 
gency Table 2 and use as association measure the -statistic (or corresponding 
p- value) for the test of independence of rows and columns. That is, 

(34) ip[m) = — — — — — — , 

ffo- [m)g.o(m)g.i{m)gi. (m) 

where gkk'{ni) = J2g^iM9,m) =fc)I(A(.g) ^ k'), gk-{m) = gko{m) + gki{m) = 
Eg I (^(5,™) = /c), and g.k'{m) = gok'{m) + gik'{m) = J^g^iHg) = k'), k,k' e 
{0, 1}. Note that in this context the 'x^ -statistic ipim) is a parameter, i.e., it is 
a function of the data generating distribution P, via the gcne-paramcter profile 
A = A(P), and is therefore unknown and constant among population units. 



Table 2. Binary gene- annotation and gene-parameter profiles. Given a binary genc-annotation 
profile A(-,m) and a binary gene-parameter profile A, one may build a 2 X 2 contingency table, 
with rows corresponding to the gene-annotation profile and columns to the gene-parameter profile. 
Cell counts are defined as g^^y (m) = ^ I (A(g, m) = fc) I {\{g) = k'), k, k' G {0, 1}. For example, 
for tests of association between GO annotation and differential gene expression, gii{m) could 
correspond to the number of genes that are annotated with GO term m and differentially expressed. 



Gene-parameter profile, A 



1 

311 (m) = 
E^=i^(9.'»)A{3) 







310 (m) = 
E^Li^(5.™)(l-A{g)) 



301 (m) 



E,=i(l-^(5."i))A{3) J2,=i(^-^i9,mm-X{g)) 



300 (m) 



Ai{m) = 
Ef=i^(f.-) 

Ao{m) = 

J2%{l-A{g,m)) 



GA = E^=i^(9) G{1-X) = J2%(^-H9)) 



G 
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Binary gene-annotation profiles For binary gene-annotation profiles, one may 
consider association parameter vectors of the form 

(35) V = ^^A. 

That is. the association parameter for the mth feature is the sum, 

G G 
^{m) ^ ^(9, m)X{g) = ^ I (^(5, m) = 1) A(g), 

9=1 9=1 

of the parameters X{g) for genes g that have the property of interest, i.e., such that 
A{g,m) = 1. Such an association parameter is considered by [:!7], to relate contin- 
uous microarray differential expression gene-parameter profiles to binary pathway 
gene-annotation profiles. 

The following standardized association parameters, corresponding to association 
measures based on two-sample t-statistics, may also be considered, 

(36) V(m)^ Ai(™)-Ao(m) 



v[\\i[m) I v[\\o(m) 
Ai(m) Ao(m) 



where, for the mth feature. A]^{m) = I (A(g, m) — fc), 

\k{m) = Y,l{A{g,m) = k)\{g)/Ak{m), 
a 

and v[X]k{m) = ^^I(A(g,TO) = k) {X{g) - Xk{m)Y / {Ak{m) - 1) denote, respec- 
tively, the numbers, averages, and variances of annotated {k ~ 1) and unannotated 
{k = 0) gene-parameters X{g). 

In commonly-encountered combined GO annotation and microarray data analy- 
ses, a binary gene-parameter profile could indicate whether genes are differentially 
expressed in two populations of cells, a continuous gene-parameter profile could 
consist of coefficients for the regression of a (censored) clinical outcome on gene ex- 
pression measures, and binary gene-annotation profiles could denote whether genes 
are annotated with particular GO terms (Section 5; [1, 2, 4, 22]). 



3.3.2. Multivariate association measures 

More generally, the mth association parameter could be based on the entire gene- 
annotation matrix ^ or a subset of columns thereof, that is, ^'(P)(to) = ?/;(to) = 
Pm{A,k{P)), for an association measure pm ■ M*^^^^ x — > M. 

Association parameters of interest include: linear combinations of association 
parameters for several features, partial correlation coefficients, x^"Statistics for 
higher-dimensional contingency tables (e.g., with one dimension corresponding to a 
gene-parameter profile A and other dimensions to several gene-annotation profiles 
A(-,m)), and (contrasts of) regression coefficients of a gene-parameter profile A on 
several gene-annotation profiles A(-, m). 

In the case of Gene Ontology annotation, the association parameter ip should 
preferably refiect the structure of the GO directed acyclic graph, by taking into 
account, for instance, annotation information for ancestor (i.e., less specific) or 
offspring (i.e., more specific) terms (Section 4.1). Specifically, let V{m) denote the 



Multiple tests of association with biological annotation metadata 



175 



set of (immediate) parents of a term m. As the genes annotated by the child term 
m are subsets of the genes annotated by the parent terms V{m), then A{g^m) = 1 
imphes A{g,p) = 1 for p e V{m). 

Following the causal inference literature [38, 43], an association parameter of 
interest for GO term m is the marginal causal effect parameter, defined as 

(37) V(™) = E[E[A|^(.,m) = 1, A(., 7'(m))]] - E[E[A|A(.,m) = 0, A(., 7'(m))]], 

where A(-, V{m)) denotes the submatrix of gene-annotation profiles for parent terms 
V{m) and the expected values are defined with respect to the empirical distribution 
of {{Aig, m), Aig, Vim)), A(.g)) : g - 1, . . . , G}. 

In the special case of binary (differential expression) gene-parameter profiles, 
the so-called parent-child method of [2'2] takes into account the structure of the 
GO DAG by testing for associations between gene- annotation and gene-parameter 
profiles using hypergeometric p-valucs computed conditionally on the annotation 
status of parent terms. 

One could also consider Boolean combinations of annotation indicators for mul- 
tiple features, that is, a transformed gene-annotation matrix whose columns are 
Boolean combinations of the columns of the original gene-annotation matrix. Such 
an approach would be particularly relevant in the context of transcription regu- 
lation, where individual features correspond to single transcription factor binding 
motifs and Boolean combinations to binding modules for multiple transcription 
factors. 

3.4- Multiple hypothesis testing 

3.4- 1- Null and alternative hypotheses 

Certain biological annotation metadata analyses may involve the two-sided test of 
the M null hypotheses of no association between gene-annotation profiles A{-,m), 
m = 1, . . . , M, and a gene-parameter profile A, i.e., the test of 

(38) Hoim) = l{^{m)=Mm)) vs. iJi(m) = I (^(m) 7^ Vo(m)) ■ 
Other analyses may call for the one-sided test of 

(39) Ho{m) — I {il!{m) < ^o{m)) vs. Hi{m) — I {il;{m) > tpoi^)) ■ 

One-sided tests are appropriate if, for example, one is interested in determining 
whether differentially expressed genes are enriched regarding GO annotation. 

The M- vector ipo = {ipo{m) : m = 1, . . . , M), of null values for the association 
parameter ijj, is determined by the biological question. For example, if ■0(m) = 
p,„(A(-, m). A) is the Pearson correlation coefficient between the gene-annotation 
profile A(-, m) and the gene-parameter profile A, then one may set ipoi'm') — 0- 

Note that in many situations, the same association measure pm is used for each 
of the M features and one only has a single, common null value ^o(jn). 

3.4-2. Test statistics 

As in Section 2.2, above, and Chapter 1 of [14], consider the general situation 
where, given a random sample Xn from the data generating distribution P, one has 



176 



S. Dudoit, S. Kele§ and M. J. van der Laan 



an asymptotically linear estimator tpn = ^(^ri) of the association parameter vector 
-0 = '^{P), with A/-dimensional vector influence curve IC{X\P). Let S(P„) = Un = 
((T„(m, m') : m,m' = 1,...,A'/) denote a consistent estimator of the covariance 
matrix S(P) = a = {a{m,m') : m,m' = 1,...,M) of the vector influence curve 
IC{X\P). For example, (T„ could be a bootstrap-based estimator of the covariance 
matrix a or could be computed from an estimator ICn{X) of the influence curve 



A broad range of association parameters ip and corresponding estimators V'n 
satisfy the above conditions. In particular, suppose A„ = A(P„) is an asymptotically 
linear estimator of the gcne-paramctcr profile A = A(P), based on a random sample 
Xn from P. Let ?/'„ = p{A, A„) denote the corresponding resubstitution, or plug-in, 
estimator of the association parameter vector -0 = p{A,X). Then, if the function 
p{A, A) is differentiable with respect to A, the resubstitution estimator is also 
asymptotically linear. 

One can therefore handle tests where the gene-parameter profiles A are (functions 
of) means, variances, correlation coefficients, and regression coefficients, and where 
the association measures p are correlation coefficients, two-sample t-statistics, and 
X^-statistics. Examples are provided in Section 5, in the context of tests of associ- 
ation between differential gene expression in ALL and GO annotation. 

Each null hypothesis Ho{m) may then be tested using a (unstandardized) dif- 
ference statistic, 



where we adopt the shorter notation a^(m) = Unijn, m) for variances. 

Certain testing problems may call for other test statistics T„, such as, F-statistics, 
X^-statistics, and likelihood ratio statistics. 

Let Qn = Qn{P) denote the typically unknown (finite sample) joint distribution 
of the M- vector of test statistics T„ = (Tn{m) : m — 1, . . . ,M), under the data 
generating distribution P. 

3.4-3. Multiple testing procedures 

As mentioned in Section 2.7, above, and detailed in [14, Chapter 2], a key feature 
of our proposed multiple testing procedures is the test statistics null distribution 
Qo used in place of the unknown true test statistics distribution Qn = Qn{P), for 
the purpose of obtaining rejection regions for the test statistics, confidence regions 
for the parameters of interest, and adjusted p- values. 

Given a suitable test statistics null distribution Qo (or estimator thereof, Qon), 
the multiple testing procedures developed in [14] and related articles may be ap- 
plied to control a broad class of Type I error rates, defined as generalized tail 
probabilities, gTP{q,g) = Pr(g(14,^n) > q), and generahzed expected values, 
gEV{g) = 'Ej[g{Vn,Rn)]j for arbitrary functions g{Vn,Rn) of the numbers of false 
positives Vn and rejected hypotheses i?„ (Section 2.8). 

For the purpose of illustration, we focus, as in Section 2.9, on control of the 
family-wise error rate, using the single-step common-cut-off maxT procedure, based 
on a non-parametric bootstrap estimator of the null shift and scale-transformed test 
statistics null distribution (Procedures 1 and 2.9). 



IC{X\P). 




(41) 



(T„(m) 



Multiple tests of association with biological annotation metadata 



177 



4. The Gene Ontology 

4.I. Overview of the Gene Ontology 

The Gene Ontology (GO) Consortium (www.geneontology.org) provides ontolo- 
gies, i.e.; structured and controlled vocabularies, to describe gene products in terms 
of their associated biological processes, cellular components, and molecular func- 
tions. The ontologies specify terminologies and relationships among terms. They 
are organism-independent and can be applied even as our knowledge of the roles of 
genes and proteins is accumulating and changing. 

The GO Consortium and other organizations supply mappings between GO 
terms and genes in various organisms. 

Detailed documentation is available on the Gene Ontology Documentation web- 
page (www. geneontology . org/GO . contents . doc .html). 

4-. 1.1. The three gene ontologies: BP, GC, and MF 

The GO Consortium provides three ontologies, each consisting of a structured net- 
work of terms describing gene products. 

Biological Process (BP or P). The Biological Process ontology refers to series 
of biological events that are accomplished by one or more ordered assemblies 
of molecular functions. Examples of broad BP terms are cellular physiologi- 
cal process (GO: 0050875) and signal transduction (GO : 0007165); examples of 
more specific BP terms are pyrimidinc base metabolism (GO : 0006206) and alpha- 
glucosidc transport (00:0000017). 

Cellular Component (CC or C). The Cellular Component ontology refers to 
subcellular structures, with the proviso that the components be part of some 
larger object, which may be an anatomical structure (e.g., rough endoplasmic 
reticulum (GO : 0005791), nucleus (00:0005634)) or a gene product group (e.g., 
ribosome (00:0005840)). 

Molecular Function (MF or F). The Molecular Function ontology refers to tasks 
or activities performed by individual (or assembled complexes of) gene prod- 
ucts. Examples of broad MF terms are catalytic activity (00:0003824), trans- 
porter activity (GO : 0005215), and binding (GO : 0005488); examples of more spe- 
cific MF terms are adenylate cyclase activity (00:0004016) and Toll binding 
(00:0005121). 

A gene product may be used in one or more biological processes, may be associ- 
ated with one or more cellular components, and may have one or more molecular 
functions. 

Example 1. Gene product ABL1_HUMAN. The Homo sapiens gene product 
Splice Isoform lA of Proto- oncogene tyrosine-protein kinase ABLl {ABL1_HUM 
AN) can be described by the following terms in each of the three gene ontologies 
(AmiGO browser. Last updated: 2006-05-25, www . godatabase . org/cgi-bin/ amigo/ 
go . cgi?view=details&search_constraint=gp&session_id=6973bll39030258& 
gp=P00519). 

Biological Process: 

regulation of progression through cell cycle (GO : 0000074) ; 
S-phase-spccihc transcription in mitotic cell cycle (00:0000115); 
mismatch repair (GO : 0006298); 
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regulation of transcription, DNA-dependent (00:0006355); 

DNA damage response, signal transduction resulting in induction of apoptosis 

(GO: 0008630). 
Cellular Component: 

nucleus (GO : 0005634). 
Molecular Function: 

DNA binding (GO : 0003677); 

protcin-tyrosine kinase activity (GO : 0004713); 

protein binding (GO : 0005515). 

4-1-2. GO directed acyclic graphs 

For each of the three gene ontologies, GO terms are organized in a directed acyclic 
graph (DAG), where a directed graph has one-way edges and an acyclic graph has 
no path starting and ending at the same vertex. Each GO term is associated with a 
single vertex, or node, in the DAG. The words term, node, and vertex, may therefore 
be used interchangeably. 

For a given GO term, an ancestor refers to a less specialized term; an offspring 
refers to a more specialized term. A parent is an immediate/direct ancestor of a 
term; a child is an immediate/direct offspring of a term. A root node has no parents 
(i.e., no incoming edges); a leaf node has no children (i.e., no outgoing edges). In a 
DAG, a child may have several parents. 

GO terms must obey the so-called true path rule: if a (child) term describes a 
gene product, then all its immediate parent and more distant ancestor terms must 
also apply to the gene product. 

The DAG structure of GO terms and corresponding true path rule are germane 
to the definition of a suitable association measure between gene-annotation profiles 
and gene-parameter profiles (Section 3.3). Furthermore, as discussed in Sections 
4.2-4.5, in the context of Bioconductor annotation software, the true path rule is 
also relevant when assembling gene-annotation matrices. 

4.-1.3- GO software tools 

Many software tools have been developed to deal with GO annotation metadata. 
The Gene Ontology Tools webpage (www.geneontology.org/GO.tools.shtml) 
provides a list of consortium and non-consortium software for searching and brows- 
ing the three gene ontologies, for annotating genes and gene products using GO, 
and for combined GO and gene expression microarray data analysis. 

For instance, the AmiGO browser (www.godatabase.org) allows: searching for 
a GO term and viewing all gene products annotated with this term; searching for a 
gene product and viewing all its associated GO terms; and browsing the ontologies 
to view relationships among terms and gene products annotated with a given term. 

The QuickGO browser (www . ebi . ac . uk/ ego) , developed by the European Bioin- 
formatics Institute (EBI) , permits searches and graphical displays of the Gene On- 
tology by GO term, GO term identifier (ID), gene product, and other identifiers. 

Software packages developed as part of the Bioconductor Project are discussed 
in Sections 4.2-4.5. 

Example 2. GO term protein-tyrosine kinase activity. To get a sense of 
the information provided by the GO Consortium, consider the Molecular Func- 
tion ontology and the GO term protein-tyrosine kinase activity, with GO term ID 
GO: 0004713. 
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Go to the AmiGO browser (www.godatabase.org), enter the GO term ID 
GO: 0004713 m the Search GO box, select Exact Match, select Terms, and click 
on the Submit Query button. There are two main options for displaying infor- 
mation on a GO term: a "tree view" and a "graphical view" . Click on the small 
tree- like icon (top-left corner of the table) to display the tree view with all ancestors 
(i.e., less specific terms) of the GO term protcin-tyrosinc kinase activity. Click on 
the Graphical View button to display the portion of the MF DAG correspond- 
ing to the GO term. Additional information may be obtained by clicking on the 
hyperlinked text protein-tyrosine kinase activity. 

The GO term protcin-tyrosinc kinase activity has one (immediate) parent, pro- 
tein kinase activity (GO : 0004672), which itself has two parents, kinase activity 
(GO: 0016301) and phosphotransferase activity, alcohol group as acceptor 
(GO : 0016773) . Altogether, the term protein-tyrosine kinase activity has 7 ancestors. 
According to the true path rule, any gene annotated with the GO term protein- 
tyrosine kinase activity should also be annotated with all of its less specific ancestor 
terms. 

The portion of the MF DAG for the GO term protein-tyrosine kinase activity is 
displayed in Figure 2 using the QuickCO browser. 

4-. 1.4- GO gene-annotation matrices 

For each of the three gene ontologies, one may define a G x Af binary gene- 
annotation matrix A, indicating for each gene g whether it is annotated with each 
GO term to. 



Section 4.5 provides sample R code for assembling GO gene-annotation matrices 
using Bioconductor annotation metadata packages. 

As detailed in Section 3, detecting associations between GO annotation and other 
interesting features of a genome may be viewed as the multiple test of the null 
hypotheses of no association between gene-annotation profiles A{-,m) and a gene- 
parameter profile A = A(F). The multiple testing methodology proposed in [14] 
and related articles is well-suited to handle the complex and unknown dependence 
structure among test statistics implied by the DAG structure of GO terms. The 
methodology is summarized in Section 2 and illustrated in Section 5, for tests of 
association between differential gene expression in ALL and GO annotation. 

4-2. Overview of R and Bioconductor software for GO annotation 
metadata analysis 

As discussed in [In], the Bioconductor Project provides R packages for accessing and 
performing statistical inference with GO annotation metadata 
(www.bioconductor.org; www.r-project.org). The packages include: 

• a general annotation software package: annotate; 

• packages for graph theoretical analyses: e.g., graph, Rgraphviz; 

• a GO-specific metadata package for navigating the three GO DAGs: GO; 



(42) 




1, if gene g is annotated with GO term m, 
0, otherwise 

g= 1,...,G, TO = 1,...,M. 
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Fig 2. DAG for MF GO term 00:0004713, QuickGO. Portion of the directed acyclic graph for 
the GO term protein-tyrosinc kinase activity (00:0004713), in the Molecular Function ontology. 
This display, obtained using the EBI QuickGO browser (Last updated 2001-03-30 04:29:44.0, 
www.ebi.ac.uk/ego), shows the nodes corresponding to all (less specific) ancestors of the term 
protein-tyrosine kinase activity. (Higher-resolution color version on website companion.) 



• an Entrcz Gcne^-spccific metadata package, providing bi-directional mappings 
between Entrez Gene IDs and GO term IDs: humanLLMappings 
(www . ncbi . nlm . nih . gov/ entrez/ query . f cgi?db=gene); 



N.B. The NCBI LocusLink database has been superseded by the Entrez Gene database. 
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• various Affymctrix chip-specific metadata packages, providing bi-directional map- 
pings between Affymetrix probe^ IDs and GO term IDs: e.g., hgu95av2, hu6800 
(www . affymetrix . com); 

• a package for annotating and generating HTML reports for Affymetrix chip data: 
annaffy. 

Bioconductor metadata packages are updated regularly to reflect the evolv- 
ing nature of biological annotation metadata; it is therefore crucial to keep track 
of version numbers. For information on Bioconductor software, please consult [17] 
and the Documentation (www.bioconductor.org/docs) and Workshops 
(www.bioconductor.org/workshops) sections of the Bioconductor Project web- 
site, in addition to the standard R help facilities (e.g., help function, manuals, 
etc.). 

The remainder of this section provides sample R code demonstrating Bioconduc- 
tor software (results reported for R Release 2.2.1 and Bioconductor Release 1.7). 
In order to run through the examples, one needs to install and load the Biocon- 
ductor packages annotate, GO, and hgu95av2. The annotation metadata used in the 
examples correspond to the following package versions. 

> libraryCaimotate) 

> library (GO) 

> library (hgu95av2) 
> 

> packageDescript ion (" annotate ")$Version 
[1] "1.8.0" 

> packageDescription("GO")$Version 
[1] "1.10.0" 

> packageDescription("hgu95av2")$Version 
[11 "1.10.0" 

Accessing and analyzing annotation metadata from databases such as GenBank 
(www.ncbi.nlm.nih.gov/Genbank), GO (www.geneontology.org), and PubMed 
(www.pubmed.gov), presupposes the ability to perform the following essential book- 
keeping task: mapping between different identifiers (ID) for a given gene/probe. 
Bioconductor annotation metadata packages consist of environment objects that 
provide key-value mappings between different sets of gene/probe identifiers. 

For instance, in the annotation metadata package hgu95av2, for the Affymetrix 
chip series HG-U95Av2, the hgu95av2PMID environment provides mappings from 
Affymetrix probe IDs (keys) to PubMed IDs (values); similarly, the hgu95av2G0 
environment provides mappings from Affymetrix probe IDs (keys) to GO term IDs 
(values). 

Example 3. Affymetrix probe ID 1635_at .As of Version 1.10.0 of the hgu95av2 

package, the Affymetrix probe with ID 1635_at corresponds to the gene with 
symbol ABLl and long name v-abl Abelson murine leukemia viral oncogene 
homolog 1, located on the long arm of chromosome 9. This probe maps to one 
GenBank accession number, one Entrez Gene ID, 14 distinct GO term IDs, and 
160 distinct PubMed IDs. 

> probe <- "1635_at" 

^N.B. In the context of Affymetrix oligonucleotide chips, we use the shorter term probe to refer 
to a probe-pair- set, i.e., a collection of perfect match (PM) and mismatch (MM) probe-pairs that 
map to a particular gene. 
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> get (probe, env=hgu95av2SYMB0L) 
[1] "ABLl" 

> get (probe, env=hgu95av2GENENAME) 

[1] "v-abl Abelson murine leukemia viral oncogene bomolog 1 " 

> get (probe, env=hgu95av2MAP) 
[1] "9q34.1" 

> get (probe, env=bgu95av2ACCNUM) 
[1] "U07563" 

> get (probe, env=bgu95av2L0CUSID ) 
[1] 25 

> unique (names (get (probe, env=bgu95av2G0) ) ) 

[1] "GO: 0000074" "GO: 0000115" "GO: 0000166" "GO: 0003677" 
[5] "GO: 0004713" "00:0005515" "GO: 0005524" "GO: 0005634" 
[9] "00:0006298" "00:0006355" "00:0006468" "00:0007242" 
[13] "00:0008630" "00:0016740" 

> lengtb(get (probe, env=hgu95av2PMID) ) 
[1] 160 

The remainder of this section gives a brief overview of two main types of Bio- 
eonductor annotation metadata packages: the GO package (Section 4.3) and the 
hgu95av2 package for the Affymetrix chip series HG-U95Av2 (Section 4.4). Sec- 
tion 4.5 illustrates how these two packages may be used to assemble a GO gene- 
annotation matrix. 

4-3. The annotation metadata package GO 

The GO package provides environment objects containing key-value pairs for map- 
pings between GO term IDs, GO terms, GO term ancestors, GO term parents, GO 
term children, GO term offspring, and Entrez Gene IDs. The GOO command lists 
all environments available in the GO package. 

> GOO 

Quality control information for GO 

Date built: Created: Fri Sep 30 03:02:24 2005 

Mappings found for non-probe based rda files: 
GOALLLOCUSID found 9556 
GOBPANCESTOR found 9888 
GOBPCHILDREN found 4989 
GOBPOFFSPRING found 4989 
GOBPPARENTS found 9888 
GOCCANCESTOR found 1612 
GOCCCHILDREN found 578 
GOCCOFFSPRING found 578 
GOCCPARENTS found 1612 
G0L0CUSID2G0 found 70818 
GOLOCUSID found 8017 
GOMFANCESTOR found 7334 
GOMFCHILDREN found 1403 
GOMFOFFSPRING found 1403 
GOMFPARENTS found 7334 
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GOOBSOLETE found 1032 
GOTERM found 18834 

For information on any of the GO environments, use the help function, e.g., 
help (GOTERM) or ?GOBPPARENTS. For instance, the environment GOTERM provides 
mappings from GO term IDs (keys) to GO terms (values); the environments 
GOBPPARENTS, GOCCPARENTS, and GOMFPARENTS, provide ontology-specific mappings 
from GO term IDs (keys) to GO term parent IDs (values). The environments 
GOALLLOCUSID, GDL0CUSID2G0, and GOLOCUSID, provide mappings between GO 
term IDs and Entrez Gene IDs and are used in Section 4.5, below, to assemble 
an Entrez Gene ID-by-GO term ID gene-annotation matrix for the MF gene ontol- 
ogy- 
Example 4. GO term ID GO: 0004713. Let us use the GO package to obtain 
information on (all) ancestors, (immediate) parents, (immediate) children, and (all) 
offspring of the term corresponding to the GO term ID GO : 0004713. 

> ## List all GO IDs 

> GO ID <- IsCenv = GOTERM) 

> length (GOID) 
[1] 18834 

> GOID [1:10] 

[1] "GO: 0000001" "GO: 0000002" "00:0000003" "GO: 0000004" 
[5] "G0:0000006" "00:0000007" "00:0000009" "00:0000010" 
[9] "00:0000011" "00:0000012" 

> 

> ## Get information on GO term corresponding to GO ID 

> ## 00:0004713 

> GOID <- "00:0004713" 

> term <- get (GOID, env=GOTERM) 

> class (term) 
[1] "GOTerms" 
attrC "package") 
[1] "annotate" 

> slotJVames (term) 

[1] "GOID" "Term" "Synonym" "Secondary" 

[5] "Definition" "Ontology" 

> term 

GOID = 00:0004713 

Term = protein-tyrosine kinase activity 

Synonym = protein tyrosine kinase activity 

Definition = Catalysis of the reaction: ATP + a protein 

tyrosine = ADP + protein tyrosine phosphate . 
Ontology = MF 
> 

> ## Get GO IDs of parents 

> parents <- get (GOID ,env=GOMFPARENTS) 

> parents 

isa 

"00:0004672" 

> mget (parents , env=GOTERM) 
$"00:0004672" 

GOID = 00:0004672 
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Term = protein kinase activity 

Definition = Catalysis of the trsmsfer of a phosphate 
group, usually from ATP, to a protein substrate . 
Ontology = MF 

> 

> ## Get GO IDs of ancestors 

> ancestors <- get (GOID ,env=GOnF ANCESTOR) 

> ancestors 

[1] "all" "G0:0003674" "G0:0003824" "G0:0016740 

[51 "G0:0016772" "G0:0016773" "G0:0016301" "G0:0004672' 
> 

> ## Get GO IDs of children 

> children <- get (GOID, env=GOMFCHILDREN) 

> children 

[1] "GO: 0004714" "GO: 0004715" "GO: 0004716" 
> 



> ## Get GO IDs of offspring 

> offspring <- get (GOID, env=GOMFOFFSPRING) 

> offspring 



[1] 


"GO. 


: 0004714" 


"GO. 


: 0004715" 


"GO. 


:0004716" 


"GO. 


: 0005020 


[51 


"GO: 


■0005021" 


"GO: 


0005023" 


"GO: 


0005010" 


"GO: 


■0005011' 


[91 


"GO: 


■0005017" 


"GO: 


0005003" 


"GO: 


0005006" 


"GO: 


■0005007' 


[131 


"GO: 


: 0005008" 


"GO: 


■0005009" 


"GO: 


■0008288" 


"GO: 


■0005018 


[171 


"GO: 


: 0005019" 


"GO: 


■0005004" 


"GO: 


■0005005" 


"GO: 


■0008313 


[211 


"GO: 


: 0004718" 















As already noted in Example 2 and Figure 2, the term corresponding to the GO 
term ID GO: 0004713 is protein-tyrosine kinase activity, in the Molecular Function 
ontology. It has one (immediate) parent term, protein kinase activity. 



4-4- Affymetrix chip-specific annotation metadata packages: The 
hgu95av2 package 

The Bioconductor Project provides Affymetrix chip-specific annotation metadata 
packages for the main chip series for the human, mouse, rat, and other genomes 
(e.g., HG-U133, HG-U95, HU-6800, MG-U74, and RG-U34 series). These packages, 
built using the infrastructure package AnnBuilder, contain environment objects for 
mappings between Affymetrix probe IDs and other types of gene/probe identifiers. 

Note that analogous packages arc not supplied for two-color spotted microarrays, 
as there is no standard microarray design for this type of platform and specialized 
annotation metadata packages may have to be created for each microarray facility 
(e.g., using AnnBuilder). Once annotation metadata packages arc available to pro- 
vide mappings between different sets of gcnc/probe identifiers, the tools in annotate 
and related packages may be used in a similar manner for any type of microarray 
platform. 

Consider the hgu95av2 package, for the Affymetrix chip series IIG-U95Av2. This 
package provides the following environments. 



> ? hgu95av2 
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> hgu95av2() 

Quality control information for hgu95av2 
Date built: Created: Tue Oct 4 21:31:35 2005 

Number of probes: 12625 
Probe number missmatch: None 
Probe missmatch: None 

Mappings found for probe based rda files: 

hgu95av2ACCNUM found 12625 of 12625 

bgu95av2CHRL0C found 11673 of 12625 

hgu95av2CHR found 12145 of 12625 

hgu95av2ENZYME found 1886 of 12625 

hgu95av2GENENAME found 11418 of 12625 

hgu95av2G0 found 9942 of 12625 

bgu95av2L0CUSID found 12203 of 12625 

hgu95av2MAP found 12109 of 12625 

hgu95av20MIM found 9881 of 12625 

bgu95av2PATH found 3928 of 12625 

hgu95av2PMID found 12086 of 12625 

hgu95av2REFSEQ found 12008 of 12625 

hgu95av2SUMFUNC found of 12625 

bgu95av2SYMB0L found 12159 of 12625 

bgu95av2UNIGENE found 12118 of 12625 
Mappings found for non-probe based rda files: 

hgu95av2CHRLENGTHS found 25 

hgu95av2ENZYME2PR0BE found 643 

hgu95av2G02ALLPR0BES found 5480 

hgu95av2G02PR0BE found 3890 

hgu95av20RGANISM found 1 

hgu95av2PATH2PR0BE found 155 

hgu95av2PFAM found 10439 

hgu95av2PMID2PR0BE found 98214 

hgu95av2PR0SITE found 8249 

For information on any of these environments, use the help function, e.g., 
help(hgu95av2G0) or ?hgu95av2G0. We focus on the three environments related 
to GO: hgu95av2G0, hgu95av2G02ALLPR0BES, and hgu95av2GD2PR0BE. 

The HG-U95Av2 chip contains 12,625 probes (corresponding to the keys in the 
hgu95av2G0 environment), with the first 10 Affymetrix probe IDs hsted befow. 

> ## List all Affymetrix IDs 

> AffylD <- ls(env = hgu95av2G0) 

> length(AffylD) 
[1] 12625 

> AffyID[l:10] 

[11 "1000_at" "1001_at" "1002_f _at" "1003_s_at" "1004_at" 
[61 "1005_at" "1006_at" "1007_s_at" " 1008 jf_at" "1009_at" 
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4-.4-i- Probes-to-most specific GO terms mappings: The hgu95av2G0 environment 

The hgu95av2G0 environment provides key-value pairs for the mappings from Ajfy- 
metrix probe IDs (keys) to GO term IDs (values). Each Affymetrix probe ID is 
mapped to a list of one or more elements, where each element corresponds to a 
particular GO term and is itself a list with the following three elements. 

• "GO ID": A GO term ID corresponding to the Affymetrix probe ID (key). 

• "Evidence": A code for the evidence supporting the association of the GO term 
to the Affymetrix probe. 

• "Ontology": An abbreviation for the name of the ontology to which the GO term 
belongs: BP (Biological Process), CC (Cellular Component), or MF (Molecular 
Function) . 

Note that only the directly associated terms or most specific terms (i.e., not 
their less specific ancestor terms) a probe is annotated with are returned as values 
in ligu95av2G0. The GO package may be used to obtain more information on the 
GO term IDs, e.g., GO term, (all) ancestors, (immediate) parents, (immediate) 
children, and (all) offspring (Section 4.3). 

Example 5. GO terms directly associated v^rith Affymetrix probe ID 
1635_at. Let us obtain GO annotation information for the probe with Affymetrix 
ID 1635_at, corresponding to the ABLl gene. The code below shows that probe 
1635_at is directly annotated with 14 distinct GO terms (the same GO term ID 
may be returned multiple times with a different evidence code). As already noted 
in Example 1, one of these terms, with GO term ID GO : 0004713, is protein-tyrosine 
kinase activity, in the Molecular Function ontology. 

> probe <- "1635_at" 

> probe2G0 <- get (probe, env = hgu95av2G0) 

> length (probe2G0) 
[11 14 

> unique (names (probe2G0) ) 

[1] "G0:0000074" "G0:0000115" "G0:0000166" " GO : 0003677 " 
[5] "GO: 0004713" "GO: 0005515" "00:0005524" "GO: 0005634" 
[9] "00:0006298" "00:0006355" "00:0006468" "00:0007242" 
[13] "00:0008630" "00:0016740" 

> probe2G0[[5]] 
$GOID 

[1] "00:0004713" 

$Evidence 
[1] "TAS" 

$Ontology 
[1] "MF" 

> get(probe2G0[[5]]$G0ID, env=GOTERM) 
GOID = 00:0004713 

Term = protein-tyrosine kinase activity 

Synonym = protein tyrosine kinase activity 

Definition = Catalysis of tie reaction: ATP + a protein 

tyrosine = ADP + protein tyrosine phosphate . 
Ontology = MF 
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The hgu95av2G0 environment (and analogous environments for other chip scries) 
may be used to assemble an Affymetrix probe ID-by-GO term ID gene-annotation 
matrix, row by row. This may entail, however, a number of data processing steps. 
Firstly, only the most specific terms a probe is annotated with are returned as 
values in hgu95av2GD. One therefore needs to add all (less specific) ancestor terms 
in order to comply with the true path rule. Secondly, several probes may correspond 
to the same gene, i.e., several Affymetrix probe IDs may map to the same Entrez 
Gene ID according to the environment hgu95av2L0CUSID. Thirdly, the ligu95av2G0 
environment returns GO terms for all three gene ontologies at once. One may need 
to separate terms according to membership in the BP, CC, and MF ontologies (e.g., 
using the GOTERM environment from the GO package). 

Alternately, one may assemble an Affymetrix probe ID-by-GO term ID gene- 
annotation matrix, column by column, using the hgu95av2G02ALLPR0BES environ- 
ment described below. 



4--4-^- GO terms-to- directly annotated probes mappings: 
The hgu95av2G02PR0BE environment 

The hgu95av2G02PR0BE environment provides key-value pairs for the mappings 
from GO term IDs (keys) to Affymetrix probe IDs (values). Values are vectors of 
length one or greater depending on whether a given GO term ID is mapped to one 
or more Affymetrix probe IDs. The value names are evidence codes for the GO 
term IDs. 

Note that the probes a particular GO term is mapped to are only those associated 
directly with the GO term (vs. indirectly via its immediate children or more distant 
offspring). For a list of all probes associated directly or indirectly with a particular 
GO term, one may use the hgu95av2G02ALLPR0BES environment. 
Example 6. Affymetrix probes directly associated with GO term ID 
GO: 0004713. In the following example, 205 distinct Affymetrix probe IDs are as- 
sociated directly with the GO term protcin-tyrosinc kinase activity (GO : 0004713). 
The Affymetrix probe IDs include 1635_at, corresponding to the ABLl gene. 

> GOID <- "GO: 0004713" 

> G02Probes <- get (GOID, env = bgu95av2G02PR0BE) 

> length(unique(G02Probes)) 
[1] 205 

> G02Probes[l:10] 

<NA> <NA> <NA> <NA> <NA> 
"1635_at" "1636_g_at" "1656_s_at" "2040_s_at" "2041_i_at" 

TAS lEA lEA lEA TAS 

"39730_at" "1084_at" "35162_s_at" "1564_at" "854_at" 

> is. element (" 1635^t " , G02Probes) 
[1] TRUE 



4-.4.3. GO terms-to-all annotated probes mappings: 
The hgu95av2G02ALLPR0BES environment 

The hgu95av2G02ALLPR0BES environment provides key- value pairs for the mappings 
from GO term IDs (keys) to Affymetrix probe IDs (values). Values are vectors of 
length one or greater depending on whether a given GO term ID is mapped to one 
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or more Affymctrix probe IDs. The value names are evidence codes for the GO 
term IDs. 

Note that, m accordance with the true path rule, the probes a particular GO term 
is mapped to are associated either directly with the GO term or indirectly via any of 
its immediate children or more distant offspring. The main difference between the 
hgu95av2G02PR0BE and hgu95av2G02ALLPR0BES environments is that the former 
considers only the GO term itself, whereas the latter considers the GO term and 
any of its offspring. Thus, the Affymetrix probe IDs returned by hgu95av2G02PR0BE 
are a subset of the probe IDs returned by hgu95av2G02ALLPR0BES. 
Example 7. Affymetrix probes directly or indirectly associated with GO 
term ID GO: 0004713. In the following example, 319 distinct Affymetrix probe 
IDs (some with multiple evidence codes) are associated either directly or indirectly 
with the GO term protein-tyrosine kinase activity (GO : 0004713). This list of 319 
Affymetrix probe IDs indeed includes the list of 205 probe IDs associated directly 
with the GO term ID GO: 0004713. 

> GOID <- "GO: 0004713" 

> G02AllProbes <- get (GOID, env = hgu95av2G02ALLPRaBES) 

> length (G02AllProbes) 
[1] 370 

> length(unique(G02AllProbes) ) 
[1] 319 

> sumds . element (G02Probes ,G02AllProbes) ) 
[1] 205 

The hgu95av2G02ALLPR0BES environment immediately yields an Affymetrix 
probe ID-by-GO term ID gene-annotation matrix, column by column. However, 
as with the hgu95av2G0 environment, a number of data processing steps may be 
required, concerning, for example, uniqueness of Entrez Gene IDs and membership 
in the BP, CC, and MF ontologies. 

4-5. Assembling a GO gene-annotation matrix 

This section provides R code for assembling an Entrez Gene ID-by-GO term ID 
gene-annotation matrix A, column by column. Specifically, rows correspond to 
(unique) Entrez Gene IDs mapping to probes on the IIG-U95Av2 chip and columns 
to terms in the Molecular Function ontology mapping directly or indirectly to at 
least 10 Entrez Gene IDs for the HG-U95Av2 chip. 

In practice, it may not be desirable to build the full G x M gene-annotation 
matrix, as this matrix could potentially be very large and sparse (padded with 
zeros). Rather, we assemble a (smaller) gene-annotation list, that provides, for each 
GO term ID, a list of Entrez Gene IDs annotated with the GO term. 
Example 8. Entrez Gene ID-by-GO term ID gene-annotation matrix for 
MF ontology. 

> ## List all Affymetrix IDs for HG-U95Av2 chip 

> AffylD <- ls(env=hgu95av2G0) 

> length(AffylD) 
[1] 12625 

> 

> ## Get all unique Entrez Gene IDs for HG-U95Av2 chip 

> LLID <- as. character (unique (unlist(mget(Affy ID, 
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+ env=hgu95av2L0CUSID) ) ) ) 

> length (LLID) 
[1] 9085 

> 

> ## Get MF GO IDs 

> GOID <- ls(env=GOTERM) 

> <- unlist(lapply(mget(GOID, env=GOTERM) , 
+ function(z) z(30ntology) ) 

> table (a) 


BP CC MF 
9888 1612 7334 

> MFID <- GOID[0=="MF"] 
> 

> ## For each MF GO ID, get all Entrez Gene IDs for genes 

> ## annotated directly or indirectly with the GO term 

> allMFLLID <- mget(MFID, env=GOALLLOCUSID) 
> 

> ## For each MF GO ID, get HG-U95Av2-specific Entrez Gene IDs 

> ## for genes annotated directly or indirectly with the GO term 

> MFLLID <- lapply (allMFLLID, function Cz) intersect (z, LLID)) 

> numMFLLID <- unlist (lapply (MFLLID, length)) 

> summary (numMFLLID) 

Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.000 1.000 1.000 9.539 1.000 6762.000 

> 

> ## Retain only MF GO IDs that map to at least 10 

> ## Entrez Gene IDs for the HG-U95Av2 chip 

> MFAnnotList <- MFLLID [numMFLLID > 9] 

> length (MFAnnotList) 
[1] 466 

> summary (unlist (lapply (MFAnnotList , length))) 

Min. 1st Qu. Median Mean 3rd Qu. Max. 
10.0 16.0 27.5 132.2 70.0 6762.0 

> MFAnnotList [1] 
$"G0: 0000146" 

[11 "4620" "4621" "4624" "4625" "4640" "4643" "4644" 
[8] "4646" "4647" "4650" "58498" 

> 

> ## Get Entrez Gene IDs for probes annotated with GO ID 

> ## GO: 0004713 

> is. element ( "GO : 0004713" , names (MFAnnotList) ) 
[1] TRUE 

> length(MFAnnotList ["GO : 0004713"] [[1]]) 
[1] 180 
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5. Tests of association between GO annotation and differential gene 
expression in ALL 

5.1. Acute lymphoblastic leukemia study of Chiaretti et al. [13] 

Our proposed approach to tests of association with biological annotation metadata 
is illustrated using the acute lymphoblastic leukemia (ALL) microarray dataset of 
[13] and Gene Ontology (GO) annotation metadata. 

5.1.1. Bioconductor experimental data R package ALL 

The ALL dataset is available in the Bioconductor experimental data R package 
ALL (Version 1.0.2, Bioconductor Release 1.7). The main object in this package 
is ALL, an instance of the class exprSet. The exprs slot of ALL provides a matrix 
of 12,625 microarray expression measures (Affymctrix chip series HG-U95Av2) for 
each of 128 ALL cell samples. The phenoData slot contains 21 phenotypes (i.e., 
covariates and outcomes) for each of the 128 cell samples. Phenotypes of interest 
include: ALL$BT, the type and stage of the cancer (i.e., B-cell ALL or T-cell ALL, 
of stage 1, 2, 3, or 4), and ALL$mol .biol, the molecular class of the cancer (i.e., 
BCR/ABL, NEG, ALL1/AF4, E2A/PBX1, pl5/pl6, or NUP-98). 

The expression measures have been obtained using the three-step robust multi- 
chip average (RMA) pre-processing method, implemented in the Bioconductor R 
package affy [1 1], and have been subjected to a base 2 logarithmic transformation. 

For greater detail on the ALL dataset, please consult the ALL package documen- 
tation. 

5.1.2. The BCR/ABL fusion 

A number of recent articles have investigated the prognostic relevance of the BCR/ 
ABL fusion in adult ALL of the B-cell lineage [21]. 

The BCR/ABL fusion is the molecular analogue of the Philadelphia chromo- 
some, one of the most frequent cytogenetic abnormalities in human leukemias. This 
t(9;22) translocation leads to a head-to-tail fusion of the v-abl Abelson murine 
leukemia viral oncogene homolog 1 (ABLl) from chromosome 9 with the 5' half 
of the breakpoint cluster region (BCR) on chromosome 22 (Figure 3). The ABLl 
proto-oncogene encodes a cytoplasmic and nuclear protein tyrosine kinase that has 
been implicated in processes of cell differentiation, cell division, cell adhesion, and 
stress response. Although the BCR/ABL fusion protein, encoded by sequences from 
both the ABLl and BCR genes, has been extensively studied, the function of the nor- 
mal product of the BCR gene is not clear. The BCR/ABL proto-oncogene has been 
found to be highly expressed in chronic myeloid leukemia (CML) and acute myeloid 
leukemia (AML) cells [30]. 

An interesting question is therefore the identification of genes that are differen- 
tially expressed between B-cell ALL with the BCR/ABL fusion and cytogenetically 
normal NEG B-cell ALL. 

In order to address this question, we consider gene expression measures for the 
n = 79 B-ccll ALL cell samples (ALL$BT equal to B, Bl, B2, B3, or B4), of the 
BCR/ABL or NEG molecular types (ALL$mol . biol equal to BCR/ABL or NEG). 
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Changed chromosome 9 




Fig 3. The Philadelphia chromosome and the BCR/ABL fusion. The BCR/ABL fu- 
sion is the molecular analogue of the Philadelphia chromosome. This t(9;22) transloca- 
tion leads to a head-to-tail fusion of the v-abl Abelson murine leukemia viral oncogene 
homolog 1 (ABLl) from chromosome 9 with the 5' half of the breakpoint cluster region 
(BCR) on chromosome 22. Figure obtained from the National Cancer Institute website 
(www . cancer . gov/Templates/db_alpha. aspx?CdrID=44179). (Color version on website compan- 
ion.) 



5.1.3. Gene filtering 

Many of the genes represented by the 12,625 probes are not expressed in B-ccll 
lymphocytes. Accordingly, as in [44], we only retain the 2,391 probes that meet 
the following two criteria: (i) fluorescence intensities greater than 100 (absolute 
scale) for at least 25% of the 79 cell samples; (ii) interquartile range (IQR) of the 
fluorescence intensities for the 79 cell samples greater than 0.5 (log base 2 scale). 

Furthermore, different probes may correspond to the same gene, i.e., map to the 
same Entrez Gene ID, according to the environment hgu95av2L0CUSID from the 
hgu95av2 package. In order to obtain one expression measure per gene, we choose 
to average the expression measures of multiple probes mapping to the same gene. 

These various pre-processing steps lead to G = 2, 071 genes with unique Entrez 
Gene IDs. 

5.1.4- Reduced ALL dataset: Genotypes and phenotypes of interest 

The combined genotypic and phenotypic data for the n — 79 B-cell ALL cell 
samples of the BCR/ABL and NEG molecular types may be summarized by the 
set = {{Xi,Yi) : i = l,...,n}, of n pairs of G-dimensional gene expres- 

sion profllcs Xi = (Xi(g) : g = 1,...,G), G = 2,071, and cancer class labels 
Y, e {NEG, BCR/ABL}. Among the n = 79 B-cell ALL cell samples, there 
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are Ubcb/abl = J2i^i^i ^ BCR/ABL) = 37 BCR/ABL samples and rir^EG = 
I (V, = NEG) = 42 NEG samples. 

5.2. Multiple hypothesis testing framework 

Our primary question of interest is the identification of genes that arc differentially 
expressed (DE) between BCR/ABL and NEG B-cell ALL. A subsequent question 
involves relating differential gene expression to GO annotation. 

As detailed below. GO annotation metadata for the filtered list of G = 2, 071 
unique genes from the HG-U95Av2 chip may be summarized by binary gene- 
annotation profiles. 

The gene-parameter profiles of interest concern differential gene expression be- 
tween BCR/ABL and NEG B-cell ALL, i.e., the association between microarray 
gene expression measures and cancer class. Continuous gene-parameter profiles of 
unstandardized and standardized measures of differential expression are estimated, 
respectively, by (unstandardized) differences of empirical means and (standardized) 
two-sample i-statistics. Binary gene-parameter profiles, indicating whether genes 
are differentially expressed, are estimated by imposing cut-off rules on two-sample 
t-statistics or adjusted p- values. 

The following association measures between GO gene-annotation profiles and 
DE gene-parameter profiles are considered: two-sample t-statistics for tests of as- 
sociation between binary GO gene- annotation profiles and continuous DE gene- 
parameter profiles; x^-statistics for tests of association between binary GO gene- 
annotation profiles and binary DE gene-parameter profiles. 

Significant associations between differential gene expression and GO annotation 
are identified by applying FWER-controUing single-step maxT Procedure 1, based 
on the non-parametric bootstrap null shift and scale-transformed test statistics null 
distribution of Procedure 2. 

5.2.1. Gene- annotation profiles 

Gene Ontology annotation metadata for the HG-U95Av2 chip series are obtained as 
described in Sections 4.2-4.5, from the following Bioconductor R packages: the GO- 
specific metadata package GO (Version 1.10.0, Bioconductor Release 1.7) and the 
Affymetrix chip-specific metadata package hgu95av2 (Version 1.10.0, Bioconductor 
Release 1.7). 

For each of the three gene ontologies, binary gene- annotation matrices Abp, 
Acc, and Amf, are assembled for the GO terms annotating at least 10 of the 
G = 2,071 filtered genes (sample R code provided in Section 4.5). Specifically, for 
gene ontology o G {BP, CC, MF}, Ao = (Ao(.g, to) : ,g = 1, . . . , G; to = 1, . . . , Mo) 
is a GxMo matrix, with element Ao{g, to) indicating whether gene g is annotated by 
GO term to and such that Ao{g, to) > 10 for each term to. The numbers of terms 
considered in each gene ontology are Mbp = 367, Mcc ~ 81, and Mmf = 185. 

5.2.2. Gene-parameter profiles 

Definition of gene-parameter profiles Consider a data structure {X, Y) ^ P, 
where X — (X(g) : 5 = 1, . . . , G) is a G = 2, 071-dimensional vector of mi- 
croarray gene expression measures and Y E {NEG, BCR/ABL} is a binary can- 
cer class label. Let ijk = Pr{Y ~ k) denote the proportion of cancers of class 
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k e {NEG, BCR/ABL}. Define conditional G-dimensional mean vectors and GxG 
covariance matrices for the expression measures of class k g {NEG, BCR/ABL} 
cancers by 

(43) fik = ^X\Y ^ k] and ak = Cov[X\Y = k], 
respectively. 

Gene-parameter profiles, concerning differential gene expression between BCR/ 
ABL and NEG B-cell ALL, may be specified in various ways. Continuous DE genc- 
parameter profiles may be defined in terms of the following unstandardized and 
standardized measures of differential gene expression between BCR/ABL and NEG 
B-cell ALL, 

(44) X'^{g) = I-Ibcr/abl{9) - fJ-NEoig) 
and 

\t/ \ _ l-^BCR/ABL 

^ yd) = I , N , 

/ crBCR/ABL(g,g) _|_ CNEGig.a) 

V Vbcr/abl Vneg 

Absolute values of X^ig) and A* (5) may be used for measuring two-sided differential 
expression, i.e., either over- or under-expression in BCR/ABL compared to NEG 
B-cell ALL. 

Binary DE gene-parameter profiles may be defined in terms of indicators for 
two-sided and one-sided differential expression. 

(45) X^{g) = l{^^BCR/ABL{g) ^ ^^NEoig)) 

= I(A'^(.9)^0) =I(A*(.9)^0), 

A+(.g) = l{lJ.BCR/ABL{g) > y^NEoig)) 

= l(A'^(g)>0)=l(A*(3)>0), 
\~{g) = l{iJ.BCR/ABL{g) < y-NEcig)) 

= l{\\g)<Q)^l{\\g)<Q). 



Estimation of gene-parameter profiles The above DE gene-parameter pro- 
files may be estimated as follows, based on the sample of gene expression 
measures for the n = 79 B-cell ALL cell samples of the BCR/ABL and NEG 
molecular types. 

The resubstitution estimators of the continuous gene-parameter profiles of Equa- 
tion (44) are based, respectively, on differences of empirical means and two-sample 
Welch i-statistics (up to the multiplier l/^/n). That is, 

(46) Xtig) = t^BCR/ABL.nig) - fJ.NEG.n{g) 

and 

1 f'BCR/ABL,n{g) ~ tJ-NEG,n{g) 

\/n / <^BCR/ABL.n{g-g) I <yNEG.n(g,g) ' 

V ^BCR/ABL njVEG 

where /^fc,„(.g) = Y^i^O^i ^ ^) Xi{g)/nk and (Jk,n{g,g) = J2i^(Xi = iXi{g) - 
l^k,nig))'^ / {nk ~ 1) denote, respectively, the empirical means and variances of the 
gene expression measures for cancers of class k e {NEC, BCR/ABL}. 



A^(.9) ^ 
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Estimating the two-sided binary gene-parameter profile of Equation (45) 
involves the two-sided test of the G null hypotheses Holg) ~ ^{^^BCR/ABL{g) = 
fJ-NEcig)), of no differences in mean gene expression measures between BCR/ABL 
and NEG B-cell ALL. Likewise, estimating the one-sided binary gene-parameter 
profiles A"*" and A~ involves, respectively, the one-sided test of the G null hy- 
potheses of no over-expression {Ho{g) = I {^iBCR/ABhig) < Mjveg(<7))) and no under- 
expression {Ho{g) = l{nBCR/ABL{g) > MnegIs))) in BCR/ABL compared to NEG 
B-cell ALL. For single-step common-cut-off maxT Procedure 1, adjusted p- values 
produce the same gene ranking as the test statistics defined in Equation (46). Sim- 
ple and naive estimators of the three sets of differentially expressed genes (i.e., 
false null hypotheses), represented by the gene-parameter profiles A^, A"'', and A~, 
are therefore given, respectively, by the sets of genes with the largest 7G values of 



Analogous estimators may also be based on other test statistics, such as unstan- 
dardized difference statistics A^. 

More sophisticated estimators, that translate the proportion 7 of rejected null 
hypotheses into a Type I error rate such as the gFWER, TPPFP, or FDR, could 
be based on adjusted p- values for the multiple test of the G null hypotheses Ho{g). 
For example, one could estimate A^ by 



where PQ^ig) are adjusted p- values for a suitably chosen multiple testing procedure, 
such as, FWER-controUing single-step maxT Procedure 1 or a TPPFP-controUing 
augmentation multiple testing procedure (Chapter 6 in [14], [40]). One-sided binary 
gene-parameter profiles A^ and A~ could be estimated likewise. 

5.2.3. Association measures for gene- annotation and gene-parameter profiles 

The association between continuous DE gene-parameter profiles, as in Equation 
(44), and binary GO gene- annotation profiles may be measured by two-sample 
Welch t-statistics (or corresponding values) . Specifically, given a continuous G- 
vector X and a binary G-vcctor y, define the following association measure. 



\K{g)l K{g), and -A^ 



[g). That is. 



(47) 




(48) 




(49) 



Xi - Xq 




where yfe = Eg I ivig) = k), Xk =Y.g^ ivig) = ^) x{g)/Vk. and 
v[x]k = Eg I {y{g) = k) {x{g) - xuf/iVk - 1), fc e {0, 1}. 
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The association between binary DE gene-parameter profiles, as in Equation (45), 
and binary GO gene-annotation profiles may be measured by -statistics (or cor- 
responding p-values) for the test of independence of rows and columns in a 2 x 2 
contingency table, such as Table 2. Specifically, given binary G-vectors x and y, 
define the following association measure, 

(50) pH:c,y)^ G(googn - goigio)^ 

(.900 +5oi)(5oo +5io)(5ii +S'oi)(5ii +5io) 
where gkk' = Eg I i^id) - k) I {y{g) = k'), fc, k' G {0, 1}. 

Given an association measure'^ p : R'-^^^^ x M.^ R^^, a G x M GO gene- 
annotation matrix A, and a G-dimensional DE gene-parameter profile A = A(P), 
the M-dimensional association parameter vector Tp = "^{P) of primary interest is 
defined as 

(51) y^ = p{A,X). 

The corresponding resubstitution estimator tpn = '^{Pn) is simply obtained by 
replacing the gene-parameter profile A by an estimator thereof A„ = A(P„), that is, 

(52) Vn = p(AA„). 



5.2.4- Null and alternative hypotheses 

For the i-statistic-based association measure p* of Equation (49), the identifica- 
tion of GO terms m that are significantly (positively or negatively) associated with 
BCR/ABL vs. NEG differential gene expression involves the two-sided test of the 
M null hypotheses Ha{m) = l(ip{m) = ipo{m)) against the alternative hypothe- 
ses Hi{m) = 1(1/1(771,) 7^ ■(/'o(to)), with null values ijjoijn) = 0. In some contexts, 
one may be interested in identifying positive (negative) associations, i.e., in the 
one-sided test of the M null hypotheses Ho{m) = l{ip{m) < ipQ^m)) (HQ^m) = 
l{ip{m) > ipo{m))) against the alternative hypotheses Hi{m) = I('(/'(to) > tpo{m)) 
{Hi\m) = I(V'(m) < ipoim))). 

For the x^-statistic-based association measure of Equation (50), the identi- 
fication of GO terms m that are significantly (positively or negatively) associated 
with BCR/ABL vs. NEG differential gene expression involves the one-sided test of 
the M null hypotheses Ho{m) = I {ip{m) < ipo{m)) against the alternative hypothe- 
ses Hi(m) = I {^{m) > ■^/'o("^))• A natural choice for the null values is the mean of 
the x^(l)-distribution, 4'o{m) ~ 1. 



5.2.5. Test statistics 



One-sided and two-sided tests of null hypotheses concerning any of the association 
parameters defined above may be based on (unstandardized) difference statistics 
Tn{m), defined as in Equation (40). 

For one-sided tests, large values of the test statistics Tn{m) provide evidence 
against the corresponding null hypotheses Ho^m), that is, rejection regions are of 
the form C„(m) = (c„(m), -t-00). For two-sided tests, large values of the absolute 
test statistics |r„(m)| provide evidence against the corresponding null hypotheses 
Ho{m). 

^N.B. For ease of notation, p' and p^, defined in Equations (49) and (50) as real-valued 
association measures, may also refer loosely to M^^-valued association measures, defined as 
pHX,y) = {p*{X{;m],y) : m = 1,...,M) and p^iX,y) = (p^ {X {■ , m) , y) : m = 1,...,M) 
for X G RC;x*^ and y S K<^. 
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5.2.6. Multiple testing procedures 

For the purpose of illustration, we focus on control of the family-wise error rate, 
using single-step maxT Procedure 2.9, based on the non-parametric bootstrap null 
shift-transformed test statistics null distribution of Procedure 2.9 (null shift values 
Xo{m) = and no scaling). 

Let On{m) denote indices for the ordered adjusted p- values, so that Poti(0„(1)) < 
• • • < Pon{On{M)). GO terms with adjusted p- values less than or equal to a are 
declared significantly associated with differential gene expression at nominal FWER 
level a. That is, the list of GO terms found to be associated with differential gene 
expression is 

(53) 7^„(a) = {m : Po„(m) < a} = {0„(1), . . . , 0„(i?„(a))} , 

where i?„(a) = |7?,„(a)| denotes the number of identified GO terms. 

5.2.7. Summary of testing scenarios 

This section summarizes our approach for identifying GO terms associated with 
BCR/ABL vs. NEG differential gene expression. For each of the three gene ontolo- 
gies (i.e., BP, CC, and MF), we consider the following three types of testing sce- 
narios, each corresponding to a different association parameter tp = p{A, A) for GO 
annotation and BCR/ABL vs. NEG differential gene expression. Scenarios MT[t,t] 
and MT[(i, are very similar and correspond, respectively, to continuous gene- 
parameter profiles of standardized and unstandardized measures of differential gene 
expression. In contrast, Scenario MT[^, x] corresponds to a binary gene-parameter 
profile of differential gene expression indicators. 

Scenario MT[t,f]: Association parameter V* * = p*(A, |A*|), for standard- 
ized continuous DE gene-parameter profile A*. Consider the two-sided 
test of 

i7*'*(m) ^ l(V^*'*(m) = V'^'*(m)) 

vs. 

ijj'*(m) = I(V'*'*(to)^Vo'*M), 
where the association parameter vector of interest is defined as 

V'*'*^P*(A|A*|), 

based on Equations (44) and (49), and the null values are ■ipQ*{m) = 0. The 
continuous DE gene-parameter profile A* is estimated by A^, as in Equation 
(46), and the association parameter -0*'* is estimated by the resubstitution esti- 
mator t/i^'* = p*{A, |A^|), as in Equation (52). The test statistics arc defined as 
(unstandardized) difference statistics, 

T^-*(m)^ V^(V':,'*(m)-0*'*(m)), 

and the null hypotheses HQ*{m) are rejected for large absolute values of T^*{m). 
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Scenario MT[(i, t]: Association parameter i/''*'* = |A''|), for unstan- 

dardized continuous DE gene-parameter profile A"*. Consider the two- 
sided test of 

i/f(m) EE l(^'^'*(m)=Vo'*M) 

vs. 

Ht\m) EE l(v/-*(m)^<*(m)), 
where the association parameter vector of interest is defined as 

based on Equations (44) and (49), and the null values are ■0o'*(m) = 0. The 
continuous DE gene-parameter profile A'' is estimated by A^, as in Equation 
(46), and the association parameter i/''^'* is estimated by the resubstitution esti- 
mator ijjn'^ = P*(^: I'^nl): ^ Equation (52). The test statistics are defined as 
(unstandardized) difference statistics, 

T,f(m)^^^(^;^'*(m)-<*(m)), 

and the null hypotheses H^'*{m) arc rejected for large absolute values of Tjf'*(m). 
Scenario MT[7^,x]: Association parameter ^jj^-^ = p^{A,X^), for binary 
DE gene-parameter profile A^. Consider the one-sided test of 

Hp^{m) EE I (^ijj^'^^m) < ipp^im)^ 

vs. 

Hf'^{m) EE I (^^^^^{m) > ^pf'^im)^ , 
where the association parameter vector of interest is defined as 

based on Equations (45) and (50), and the null values are ipt'^i^) = 1 
mean of the x^(l)-distribution). The following two types of estimators A^ arc 
considered for the binary DE gene-parameter profile A'^: A^^g,, with numbers of 
DE genes jG = 20, 50, 100 (Equation (47)); A^^^, defined in terms of adjusted 
p-values for FWER-controUing permutation-based single-step maxT Procedure 
2.9 {B = 1,000 permutations of the cancer class labels) and nominal FWER 
level a = 0.05 (Equation (48)). Given an estimator A^ of A'^, the association 
parameter tp^'^ is estimated by the resubstitution estimator tp^'^ = p^{A,X^), 
as in Equation (52). The test statistics are defined as (unstandardized) difference 
statistics, 

and the null hypotheses Hf'^{m) are rejected for large values of T^'^{m). 

For each of the three testing scenarios, the null shift-transformed test statistics 
null distribution Qq is estimated as in Procedure 2, with B = 5, 000 non-parametric 
bootstrap samples of the data and Z^{m, b) = T^{m, h) — E[r^(m, •)] (i.e., 

null shift values Ao(to) = and no scaling). For one-sided testing Scenario MT[7^, x]. 
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bootstrap-based single-step maxT adjusted p- values Pori(™) are computed as in 
Procedure 2. For two-sided testing Scenarios MT[i, t] and MT[(i, t], adjusted p- values 
are computed based on absolute values of {m, b) and Tn{m). 

In what follows, the G-dimensional gene-parameter profiles A correspond to the 
G = 2, 071 genes with unique Entrcz Gene IDs, obtained as described in Section 
5.1. For each of the three gene ontologies, binary gene-annotation matrices are 
assembled for the GO terms annotating at least 10 of the G = 2, 071 genes of 
interest: G = 2, 071 x Mbp — 367 gene-annotation matrix Abp for the BP ontology, 
G = 2, 071 X Mcc ~ 81 gene-annotation matrix Acc for the CC ontology, and 
G = 2, 071 X Mmf = 185 gene-annotation matrix Amf for the MF ontology. 

5. 3. Results 

5.3.1. Differentially expressed genes between BCR/ABL and NEG B-cell ALL 

In order to identify differentially expressed genes between BCR/ABL and NEG 
B-cell ALL, two-sided tests of the G null hypotheses 

Ho{g) = l{tJ-BCR/ABL{g) = iJ-NEoig)) 

are performed using the two-sample <-statistics {g) of Equation (46) and FWER- 
controUing bootstrap-based single-step maxT Procedure 2. Adjusted p- values Pj^ (g) 
are obtained using the MTP function from the multtest package (Version 1.8.0, Bio- 
conductor Release 1.7), with B = 5,000 non-parametric bootstrap samples and 
other arguments set to their default values. 

Figure 4 displays a normal quantile-quantilc plot of the test statistics A*j(g) 
(Panel (a)) and a plot of the sorted bootstrap-based single-step maxT adjusted 
p- values Pfnid) (Panel (b)). A handful of genes stand out in terms of their large 
absolute test statistics and small adjusted p-values. 

For control of the FWER at nominal level a = 0.05, Procedure 2.9 identifies 
16 differentially expressed genes, i.e., 16 genes with Pj^ls) — Table 3 provides 




-2-10 1 2 
Theoretical Quantiles 




Index 



Panel (a): Test statistics 



Panel (b): Adjusted p- values 



Fig 4. Differentially expressed genes between BCR/ABL and NEG B-cell ALL. Panel (a): Normal 
quantile-quantile plot of two-sample t-statistics A*j(g). Panel (b): Plot of sorted bootstrap-based 
single-step maxT adjusted p-values P^'.^ig). 
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Table 3. Differentially expressed genes between BCR/ABL and NEG B-cell ALL. This table 
provides the Affymetrix probe IDs, Entrez Gene IDs (hgu95av2L0CUSID environment in hgu95av2 
package), gene symbols (hgu95av2SYMBQL environment), gene names (hgu95av2GENENAME environ- 
ment), test statistics A^(3) (Equation (16)), and adjusted p-values P^(g), for the 16 genes found 
to be significantly differentially expressed between BCR/ABL and NEG B-cell ALL, at nomi- 
nal FWER level a = 0.05, according to the bootstrap-based single-step maxT procedure, with 
two-sample i-statistics A^(g) and B = 5,000 bootstrap samples. A more detailed hyperlinked 
table, including information on gene function, chromosomal location, links to GenBank, Entrez 
Gene, NCBI Map Viewer, UniGene, PubMed, AmiGO, and KEGG, is provided on the website 
companion (Supplementary Table 10.1). 

Probe ID Entrez Gene ID Symbol A^(g) Ptnio) 

1635_at 25 ABLl WM O.Oc-f-00 
v-abl Abelson murine leukemia viral oncogene homolog 1 

40202_at 687 KLF9 6.33 O.Oc-f-00 
Kruppel-like factor 9 

37027_at 79026 AHNAK 5.71 1.4c-03 
AHNAK nucleoproteln (desmoyokln) 

39837_s_at 168544 ZNF467 5.45 3.4e-03 
zinc finger protein 467 

33774_at 841 CASP8 5.29 4.2e-03 
caspase 8, apoptosis-related cysteine peptidase 

37014_at 4599 MXl -5.23 5.0e-03 
myxovirus (influenza virus) resistance 1, 
interferon-inducible protein p78 (mouse) 



2039_s_at 2534 


FYH 


5.21 


5.0e-03 


FYN oncogene related to SRC, 


FGR, YES 






39329_at 87 


ACTNl 


4.97 


9.6C-03 


actinin, alpha 1 








32542_at 2273 


FHLl 


4.96 


l.Oe-02 


four and a half LIM domains 1 








40051_at 9697 


TRAM2 


4.59 


2.7e-02 


translocation associated membrane protein 2 






38032_at 9900 


SV2A 


4.54 


3.1e-02 


synaptic vesicle glycoprotein 


2A 






39319_at 3937 


LCP2 


4.50 


3.5e-02 



lymphocyte cytosolic protein 2 

(SH2 domain containing leukocyte protein of 76 kDa) 

33232_at 1396 CRIPl 4.46 3.7e-02 

cysteine-rich protein 1 (intestinal) 

36591_at 7277 TUBAl 4.37 4.4e-02 

tubulin, alpha 1 (testis specific) 

38994_at 8835 S0CS2 4.35 4.7e-02 

suppressor of cytokine signaling 2 

40076_at 7165 TPD52L2 -4.33 4.8c-02 

tumor protein D52-like 2 



the test statistics, adjusted p-values, and various identifiers for these 16 genes. A 
more detailed hyperlinked table is posted on the website companion (Supplementary 
Table 10.1; www . stat .berkeley . edu/~sandrine/MTBook/BAM/BAM. html). 

Only 2 of the 16 identified genes have a negative test statistic (MXl and TPD52L2), 
suggesting that most DE genes tend to be over-expressed in cell samples with the 
BCR/ABL fusion. 

Unsurprisingly, the gene showing the most over-expression in BCR/ABL cell 
samples, as measured by the f-statistics A^, is the ABLl gene (v-abl Abelson 
murine leukemia viral oncogene homolog 1). located on the long arm of chro- 
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mosome 9 (9q34.1). As mentioned in Section 5.1, the BCR/ABL phenotype is indeed 
defined in terms of the ABLl gene. 

Furthermore, many of the DE genes seem to be related to apoptosis or oncogene- 
sis. For example, the Kruppel-like factor 9 (KLF9) gene encodes a transcription 
factor that binds GC-box elements in gene promoter regions. The Kriippel-like 
factor (KLF) family is comprised of highly related zinc- finger proteins, that are im- 
portant components of the eukaryotic cellular transcriptional machinery and that 
take part in a wide range of cellular functions (e.g., cell proliferation, apoptosis, dif- 
ferentiation, and neoplastic transformation). In particular, KLFs have been linked 
to various cancers [-'>]. The intron-less gene AHNAK nucleoprotein (desmoyokin) 
(AHNAK), located on the long arm of chromosome 11 (llql2.2), encodes an un- 
usually large protein (~ 700 kiloDalton (kDa)) that is typically repressed in cell 
lines derived from human neuroblastomas and several other types of tumors [ > ]. 
Yet another example, the caspase 8, apoptosis-related cysteine peptidase 
(CASP8) gene, encodes a key enzyme at the top of the apoptotic cascade and has 
been linked to neuroblastoma [ ]. Likewise, other genes listed in Table 3, including 
MXl, FYN, ACTNl, FHLl, and TRAM2, appear to be related to the molecular biology 
of cancer. 

Our results are in general agreement with those of [44], slight differences being 
due, most likely, to our preliminary gene filtering, which involves averaging the 
expression measures of multiple probes mapping to the same Entrez Gene ID. 

For greater detail, the interested reader is invited to consult Supplementary Table 
10.1 on the website companion and follow links to PubMcd and other databases. 
Further exploration of the DE genes may be performed in R using the Bioconductor 
packages annotate and annaffy. 



5.3.2. GO terms associated with differential gene expression between BCR/ABL 
and NEG B-cell ALL 

Figure 5 displays, for each of the three gene ontologies and each of the three testing 
scenarios, plots of the sorted adjusted p-values, -Pon(0„(l)) < • • • < Pon(0„(M)), 
for FWER-controUing bootstrap-based single-step maxT Procedure 2 (i? = 5, 000 
bootstrap samples). The smaller the adjusted p- values, the less conservative the 
procedure, and the longer the list 72,„(a) = {m : PQn{m) < a} of identified GO 
terms at any given nominal Type I error level a. 

Table 4 summarizes the results in terms of the numbers i?,i(a) = |7?.„(a)| of 
GO terms found to be significantly associated with BCR/ABL vs. NEG differential 
gene expression for different nominal FWER levels a. 

In general, adjusted p-values tend to be quite large, with only a handful of GO 
terms identified as being significantly associated with BCR/ABL vs. NEG differ- 
ential gene expression for nominal FWER level a G {0.05, 0.10, 0.20}. The ad- 
justed p- values for Scenarios MT[t,i] and MT[(i, i] (red and blue plotting symbols), 
corresponding, respectively, to standardized and unstandardized continuous DE 
gene-parameter profiles, are similar. For the BP and MF gene ontologies. Scenario 
MT[<, t] seems to be slightly more conservative than Scenario MT[c?, t]; however, this 
does not hold for the CC ontology. Scenario MT[7^, x], with four different estimators 
of the binary DE gene-parameter profile A'^ , tends to be more conservative than ei- 
ther Scenario MT[t, t] or MT[(i, t]. Furthermore, the choice of parameter 7G, for the 
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Fig 5. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, adjusted p-values. Plots of sorted bootstrap-based single-step maxT adjusted p- values 
Ponim), for each of the three gene ontologies and each of the three testing scenarios. 

Table 4. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL. This table reports, for each of the three gene ontologies and each of the three testing 
scenarios, the numbers R„{a) = \TZn{ct)\ of GO terms found to be significantly associated with 
BCR/ABL vs. NEG differential gene expression for different nominal FWER levels a. 

Nominal FWER level, a 
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number of genes called differentially expressed, can have a substantial impact on 
the adjusted p-values for Scenario MT[^, x ■ jG]. There are some indications, espe- 
cially for the CC ontology, that greater values of the parameter 7G lead to greater 
numbers of identified GO terms. Note that for Scenario MT[7^, x], the p- value-based 
estimator A^^,, with a = 0.05, and the naive estimator X^iG' "^ith 7G = 20, yield 
very similar results (green and purple plotting symbols). Indeed, when applied to 
the entire dataset for the n = 79 cell samples, permutation-based single-step maxT 
Procedure 1 identifies 20 genes as being differentially expressed between BCR/ABL 
and NEG B-ccU ALL at nominal FWER level a = 0.05. In other words, A^g Q5 and 

A^ 20 yield the same estimate of the binary gene-parameter profile A'^ for the set of 
DE genes. Minor discrepancies between the results of Scenarios MT[7^, x : a = 0.05] 
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Fig 6. GO terms associated with differential gene expression between BCR/ABL and NEG B-cell 
ALL, common terms between testing scenarios. Plots of numbers of common GO terms among 
sets of ordered GO terms Onir) of various cardinality r for pairs of testing scenarios. Scenario 
MT[t,t] is used as the baseline in the top panels and Scenario MT[^,x : a = 0.05], with adjusted 
p- value-based estimator A^^, a = 0.05, for the binary DE gene-parameter profile A'^, is used 
as the baseline in the bottom panels. For example, the blue curve in the top-left panel is a plot 
of I C'n'*('') n ©^''(r) I vs. r for the MF gene ontology, i.e., of the overlap between the r most 
significant MF GO terms according to Scenarios MT[(i, t] and MT[t,t]. 



and MT[^, x '■ iG = 20] are due to the fact that while the estimators A„ q and 
A^20 coincide on the original sample, they may differ on bootstrap samples of these 
data. 

Next, the three testing scenarios are compared in terms of the contents of the lists 
7?.„(a) of identified GO terms. Specifically, let C'„(r) = {0„(1), . . . ,0„(r)} denote 
the set of indices corresponding to the r smallest adjusted p- values for a given gene 
ontology and testing scenario. Measures of agreement between testing scenarios 
are provided by the numbers of common GO terms among sets of ordered GO 
terms Onlr) of various cardinality r, i.e., by the cardinality of intersections of sets 
On {r) for different testing scenarios. Figure 6 displays plots of numbers of common 
GO terms for pairs of testing scenarios. As expected, there is substantial overlap 
between the GO terms identified by Scenarios MT[t, i] and MT[(i, i] for continuous 
DE gene-parameter profiles (blue plotting symbols in top panels). This suggests 
that, for the ALL dataset, standardized (A*) and unstandardized (A'') continuous 
measures of differential gene expression have similar properties. In contrast, there is 
much less overlap between the GO terms identified by Scenario MT[^, x], for binary 
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Fig 7. GO terms associated with differential gene expression between BCR/ABL and NEG B-cell 
ALL, conditional distribution of given A, Conditional boxplots of the estimated continuous 
DE gene-parameter profile given the gene-annotation profiles A(-, m) for the top two GO terms 
m G {On(l), On(2)} identified according to each of the three testing scenarios. Rows correspond 
to gene ontologies and columns to testing scenarios. In each panel, the white and gray boxplots 
correspond, respectively, to the GO terms with the smallest and second smallest adjusted p- values; 
boxplots for unannotated and annotated estimated gene-parameter profiles, {A^((ji) : A{g,m) = 
0) and {^n{g) '■ A{g,m) = 1), are labeled as and 1, respectively. Non-overlapping notches 
(informally) represent large difi^erences in medians. 



DE gene-parameter profiles, and either Scenario MT[t, t] or MT[(i, t]. For example, 
for the MF gene ontology, among the top 10 GO terms ©^(lO) identified by each 
testing scenario, 6 are common to Scenarios MT[t,t] and MT[(i, i], whereas at most 
3 are common to Scenarios MT[i,i] and MT[7^,x]. Again, note the near perfect 
agreement between Scenarios MT[7^,x : a = 0.05] and MT[^,x : 7G = 20] (purple 
plotting symbols in lower panels). Figure 6 again illustrates the lack of robustness 
of Scenario MT[7^, x ■ iG] to the choice of parameter 7G. 

Moreover, examine graphical summaries of the joint distributions of the esti- 
mated continuous DE gene-parameter profile and the gene- annotation profiles 
A{-,m) for the top two GO terms m e {0„(1), 0„(2)} identified according to each 
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testing scenario. Figure 7 displays conditional boxplots of given A{-,m), that 
is, boxplots of the unannotated and annotated estimated gene-parameter profiles, 
{Xn{g) : A{g,m) = 0) and [\n[g) ■ A{g,m) = 1), respectively. Although the box- 
plots reveal clear differences (non-overlapping notches) between unannotated and 
annotated profiles for some of the GO terms (e.g., MF term GO : 0003735), the differ- 
ences can be subtle for other terms (e.g., MF term 00:0003924). Not surprisingly, 
the most extreme differences are seen for Scenarios MT[t,t] and MT[d, t], and, to a 
lesser extent, for Scenario MT[7^,x : a — 0.05] for the CC ontology. The boxplots 
again illustrate differences between Scenario MT[7^,x] and either Scenario MT[t,t] 
or MT[d,t]. 

Tables 5, 6, and 7 report various p-value-based measures of association between 
the estimated DE gene-parameter profiles and A^^ and the gene-annotation 
profiles A{-,m) for the top two GO terms m S {0„(1), 0„(2)} identified according 
to each testing scenario, in the BP, CC, and MF gene ontologies, respectively. The 
transformation to the [0, 1] p- value scale allows a more direct comparison of the var- 
ious testing scenarios. The tables again highlight the differences between Scenario 
MT[7^,x], for binary DE gene-parameter profiles, and either Scenario MT[t, i] or 
MT[(i, t], for continuous DE gene-parameter profiles. As expected, Scenarios MT[t, t] 
and MT[d, t] tend to identify GO terms with small p-values -Po^*(m.) for t-tests of 
association between estimated continuous gene-parameter profiles A^ and gene- 
annotation profiles A{-,m). In contrast, and also as expected. Scenario MT[^,x] 
tends to identify GO terms with small p- values Pf^^{m) for x^-tests of associa- 
tion between estimated binary gene-parameter profiles A^^, and gene-annotation 
profiles v4(-,77i). Furthermore, the tables corroborate our earlier observation that 

Table 5. GO terms associated with differential gene expression between BCR/ABL and NEG B- 
cell ALL, top two BP GO terms. This table provides association measures between the estimated 
DE gene-parameter profiles and Xn,a and the gene-annotation profiles A(-,m) for the top two 
BP GO terms m £ {On(l), On(2)} identified according to each of the three testing scenarios. 
Ai{m) = ^^j4(g, m): Number of genes directly or indirectly annotated with GO term m (out 

of G = 2,071 genes, GQALLLOCUSID environment in GO package). P|^^'(m): Nominal unadjusted 
p-value for the two-sample i-tcst comparing the unannotated and annotated estimated continuous 
DE gene-parameter profiles, (A'j(g) : A{g, m) = 0) and {X^{g) ■ A{g, m) = 1), respectively (t.test 
function from the R package stats, with default argument values). P^^^{m): Unadjusted p-value 
for the x^-test of independence between the estimated binary DE gene-parameter profile A^^, 
a = 0.05, and the gene-annotation profile A(-, m) (chisq.test function from the R package stats, 
with arguments simulate. p. value = TRUE, correct=FALSE). Po„(m): Bootstrap-based single-step 
maxT adjusted p-valuc, according to which the top two GO terms arc identified for each testing 
scenario. 



BP 



Scenario 






GO term 


Ai{m) 






Pon (m) 


MT[i,t] 






GQ 


: 0008152 


1076 


2.5e-09 


1.7e-01 


2.6e-02 








GQ 


: 0044237 


1045 


3.8e-08 


1.8e-01 


4.3e-02 


MT[d, t] 






GQ 


: 0006091 


98 


5.2e-06 


6.3C-01 


3.7e-02 








GQ 


: 0000226 


14 


1.8e-03 


l.Oe+00 


5.8e-02 




a = 


: 0.05] 


GQ 


: 0008361 


27 


5.5C-02 


3.5C-03 


8.3e-02 








GQ 


: 0016049 


27 


5.5C-02 


1.5C-03 


8.3e-02 


MT[^,X: 


7G 


= 20] 


GQ 


: 0008361 


27 


5.5e-02 


4.0e-03 


2.1e-01 








GQ 


: 0016049 


27 


5.5e-02 


1.5e-03 


2.1e-01 




7G 


= 50] 


GQ 


: 0048522 


87 


3.6e-02 


6.5e-03 


1.9e-01 








GQ 


: 0048518 


96 


4.4e-02 


1.3e-02 


2.3e-01 


MT[^,X: 


7G 


= 100] 


GQ 


: 0050793 


24 


8.5e-02 


1.7e-02 


1.5e-01 








GQ 


: 0007155 


59 


5.9e-04 


1.2e-01 


2.0e-01 
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Table 6. GO terms associated with differential gene expression between BCR/ ABL and NEG 
B-cell ALL, top two CC GO terms. Details in Table 5 caption. 



CC 



Scenario 






GO term 


Ai{m) 






^'on(m) 


UJ[t,t] 






GQ 


: 0005840 


25 


1.3e-08 


l.Oe+OO 


5.6e-03 








GQ 


: 0030529 


77 


3.1e-10 


6.4e-01 


1.4e-02 


MT[d, t] 






GQ 


: 0005840 


25 


1.3e-08 


l.Oe+00 


4.0e-03 








GQ 


: 0005830 


11 


2.8e-05 


l.Oc+OO 


5.2e-03 


MTM,X: 


a = 


: 0.05] 


GQ 


: 0005578 


10 


1.7e-02 


l.Oe-01 


4.9e-01 








GQ 


: 0031012 


10 


1.7e-02 


l.le-01 


4.9e-01 


MT[^,X: 


7G 


= 20] 


GQ 


: 0005578 


10 


1.7e-02 


l.Oe-01 


3.5e-01 








GQ 


: 0031012 


10 


1.7e-02 


9.2e-02 


3.5e-01 




iG 


= 50] 


GQ 


: 0005576 


54 


9.1e-04 


l.Oc+00 


7.8e-03 








GQ 


: 0005615 


31 


4.8e-02 


2.4e-01 


7.8e-03 




7G 


= 100] 


GQ 


: 0005576 


54 


9.1e-04 


l.Oc+00 


4.9C-02 








GQ 


: 0005615 


31 


4.8e-02 


2.6e-01 


l.Sc-Ol 



Table 7. GO terms associated with differential gene expression between BCR/ ABL and NEG 
B-cell ALL, top two MF GO terms. Details in Table 5 caption. 



MF 



Scenario 






GO term 


Ai{m) 






Pon (m) 


UJ[t,t\ 






GQ 


: 0003735 


24 


l.le-09 


l.Oe+00 


2.4e-03 








GQ 


: 0003723 


143 


1.5e-06 


4.1e-01 


1.2e-01 


UT[d, t] 






GQ 


: 0003735 


24 


l.le-09 


l.Oc+00 


2.2e-03 








GQ 


: 0003723 


143 


1.5e-06 


3.8e-01 


7.8e-02 


MT[^,X: 


a = 


: 0.05] 


GQ 


: 0004930 


10 


2.2e-01 


3.0e-03 


3.7e-02 








GQ 


: 0003924 


34 


6.5e-01 


3.9e-02 


7.0e-01 


MT[^,X: 


7G 


= 20] 


GQ 


: 0004930 


10 


2.2e-01 


3.0e-03 


1.7e-02 








GQ 


: 0003924 


34 


6.5e-01 


3.8e-02 


6.2e-01 


MT[7^,X: 


iG 


= 50] 


GQ 


: 0004930 


10 


2.2e-01 


3.5e-03 


4.1e-01 








GQ 


: 0030246 


22 


8.6e-01 


2.0e-01 


4.8e-01 


MT[^,X: 


7G 


= 100] 


GQ 


: 0005509 


69 


3.8e-04 


1.3e-01 


3.1e-01 








GQ 


: 0004930 


10 


2.2e-01 


1.5e-03 


3.3e-01 



Scenario MT[7^,x] tends to be more conservative than cither Scenario MT[t,t] or 
MT[(i, t]. Indeed, some of the GO terms with smah p- values Po^*(m) for continuous 
gene-parameter profiles have very large p- values Pj^''^(m) for binary gene-parameter 
profiles (e.g., MF term GO : 0003735 in Table 7). Such terms are likely to be identified 
by Scenarios MT[t, i] and MT[rf, i], but missed by Scenario MT[7^,x]. The converse 
phenomenon is not as striking. However, one should keep in mind that Scenario 
MT[7^, x] depends on the choice of estimator for the binary DE gene-parameter 
profile A^, i.e., on parameters such as a and jG. In particular, with certain values 
of a (or 7G), binary Scenario MT[7^,x] may become more similar to either con- 
tinuous Scenario MT[t, i] or MT[(i, t]. Column Ai{m) in Tables 5-7 suggests that, 
compared to Scenario MT[7^,x], Scenarios MT[t,t] and MT[c?, tend to identify 
GO terms annotating a greater number of genes (this observation also holds for the 
top 20 terms identified according to each testing scenario; data not shown). 

Figure 8 displays a scatterplot matrix of the 50 smallest adjusted p- values, based 
on Scenario MT[t, t], for each of the three gene ontologies. The plots indicate that 
more terms tend to be identified in the BP ontology compared to either the CC or 
MF ontologies, and fewer terms tend to be identified in the MF ontology compared 
to either the BP or CC ontologies. Note that comparisons based on adjusted p- values 
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Fig 8. GO terms associated with differential gene expression between BCR/ABL and NEG B-cell 
ALL, comparison of adjusted p-values for the three gene ontologies. Scatterplot matrix of the 50 
smallest adjusted p-values for each of the three gene ontologies, based on Scenario MT[t,t]. The 
identity line is drawn for reference. 



take into account differences in the numbers of tested hypotheses, Mbp ~ 367, 
Mcc = 81, and Mmf = 185, for each ontology. 

Tables 8, 9, and 10 list the 20 GO terms with the smallest adjusted p-values 
for Scenario MT[t, i], applied to the BP, CC, and MF gene ontologies, respec- 
tively. Figures 9, 10, and 11 display portions of the directed acycHc graphs for 
the top 20 GO terms in each ontology. The figures suggest that GO terms asso- 
ciated with BCR/ABL vs. NEG differential gene expression tend to concentrate 
in certain branches of the DAGs, i.e., differential expression is associated with re- 
lated properties of gene products. Although it is known that many of the effects of 
the BCR/ABL fusion are mediated by tyrosine kinase activity, the MF GO term 
protein-tyrosinc kinase activity (GO: 0004713) does not appear to be significantly 
associated with differential gene expression between BCR/ABL and NEG B-cell 
ALL (adjusted p-value of 0.8890 for Scenario MT[t,t]). 

For illustration purposes, we further investigate two of the GO terms from Ta- 
bles 8 and 10: GO term anti-apoptosis (GO : 0006916), with ninth smallest adjusted 
p- value for Scenario MT[t,t] appHed to the BP gene ontology, and GO term struc- 
tural constituent of ribosome (GO : 0003735), with the smallest adjusted p- value for 
Scenario MT[t, i] applied to the MF gene ontology. Tables 11 and 12 list genes 
directly or indirectly annotated with GO terms GO: 0006916 and GO: 0003735, re- 
spectively. Figure 12 displays mean-difference plots of the average expression mea- 
sures in BCR/ABL and NEG cell samples for genes annotated with GO terms 
GO : 0006916 and GO : 0003735. 
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Table 8. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, top 20 BP GO terms. This table lists the 20 GO terms with the smallest adjusted 
p-values for Scenario MT[t, t] applied to the BP gene ontology. Ai(m) = ^(Si"^)- Number of 
genes directly or indirectly annotated with GO term m (out of G = 2, 071 genes, GOALLLQCUSID 
environment in GO package). Pon{m): Bootstrap-based single-step maxT adjusted p-valuo for 
Scenario MT[t,t]. 



BP, Scenario MT[t,t] 



GO term ID 


GO term 


Ai(m) 


Pon(m) 


GO: 


: 008152 


metaboJism 


1076 


2.6e-02 


GO: 


: 044237 


cellular metabolism 


1045 


4.3O-02 


GO: 


: 009058 


biosynthesis 


187 


7.5C-02 


GO: 


: 044238 


primary metabolism 


1002 


7.5C-02 


GO: 


: 044249 


cellular biosynthesis 


169 


8.60-02 


GO: 


: 006091 


generation of precursor metabolites and energy 


98 


9.3O-02 


GO: 


: 019882 


antigen presentation 


15 


l.lc-01 


GO: 


: 030333 


antigen processing 


14 


1.4O-01 


GO: 


: 006916 


anti-apoptosis 


21 


1.6C-01 


GO: 


: 043066 


negative regulation of apoptosis 


26 


1.7C-01 


GO: 


: 043069 


negative regulation of programmed cell death 


26 


1.7C-01 


GO: 


: 007154 


cell communication 


390 


1.80-01 


GO: 


: 006457 


protein folding 


52 


1.9O-01 


GO: 


: 007165 


signal transduction 


351 


1.9O-01 


GO: 


: 000226 


microtubule cytoskclcton organization and biogenesis 


14 


2.3O-01 


GO: 


: 006082 


organic acid metabolism 


65 


2.5O-01 


GO: 


: 006163 


purine nucleotide metabolism 


29 


2.8e-01 


GO: 


: 007155 


cell adhesion 


59 


2.8e-01 


GO: 


: 007028 


cytoplasm organization and biogenesis 


10 


3.O0-OI 


GO: 


: 019752 


carboxylic acid metabolism 


63 


3.I0-OI 



Table 9. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, top 20 GC GO terms. Details in Table 8 caption. 



CC, Scenario MT[t,t] 



GO term ID 


GO term 


Ai{m) 


Pon(m) 


GO: 


: 0005840 


ribosomc 


25 


5.60-03 


GO: 


: 0030529 


ribonuclcoprotcin complex 


77 


1.4e-02 


GO: 


: 0005830 


cytosolic ribosome (sensu Eukaryota) 


11 


1.4e-02 


GO: 


: 0043234 


protein complex 


334 


7.8e-02 


GO: 


: 0005886 


plasma membrane 


200 


1.3O-01 


GO: 


: 0005829 


cytosol 


78 


2.20-01 


GO: 


: 0005737 


cytoplasm 


578 


2.3O-01 


GO: 


: 0005887 


integral to plasma membrane 


125 


2.3O-01 


GO: 


: 0031226 


intrinsic to plasma membrane 


125 


2.3O-01 


GO: 


: 0019866 


inner membrane 


37 


2.60-01 


GO: 


: 0005743 


mitochondrial inner membrane 


28 


2.60-01 


GO: 


: 0005746 


mitochondrial electron transport chain 


11 


2.7O-01 


GO: 


: 0000502 


proteasomc complex (sensu Eukaryota) 


26 


2.7O-01 


GO: 


: 0000323 


lytic vacuole 


28 


2.9O-01 


GO: 


: 0005764 


lysosome 


28 


2.9O-01 


GO: 


: 0005576 


extracellular region 


54 


3.I0-OI 


GO: 


: 0005773 


vacuole 


29 


3.20-01 


GO: 


: 0005622 


intracellular 


1152 


3.4e-01 


GO: 


: 0043228 


non-membrane-bound organelle 


218 


3.5O-01 


GO: 


: 0043232 


intracellular non-membrane-bound organelle 


218 


3.5O-01 
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Table 10. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, top 20 MF GO terms. Details in Table 8 caption. 



MF, Scenario MT[t,t] 



GO term ID 


do term 






GO : 


: UUuo 1 6o 


structural constituent of ribosome 


24 


2.4e-03 


GO: 


: 0003723 


RNA binding 


143 


1.2e-01 


GO : 


: 0048037 


cofactor binding 


11 


1.5e-01 


GO : 


: OUolOoz 


unfolded protein binding 


47 


2.2e-01 


GO : 


: UUloooo 


isomerase activity 


28 


2.3e-01 


GO : 


: 0016491 


oxidoreductase activity 


89 


3.5e-01 


GO : 


: 0005509 


calcium ion binding 


69 


3.5e-01 


GO : 


: 0015399 


primary active transporter activity 


57 


4.3e-01 


GO : 




receptor activity 


101 


4.5e-01 


GO: 


: 0004871 


signal transducer activity 


242 


4.6e-01 


GO: 


: 0016765 


transferase activity, transferring alkyl or aryl (other than 
methyl) groups 


10 


4.6e-01 


GO: 


: 0016860 


intramolecular oxidoreductase activity 


13 


4.6e-01 


GO: 


: 0016614 


oxidoreductase activity, acting on CH-OH group of 
donors 


18 


4.7e-01 


GO: 


: 0016616 


oxidoreductase activity, acting on the CH-OH group of 
donors, NAD or NADP as acceptor 


18 


4.7e-01 


GO: 


: 0043169 


cation binding 


230 


5.0e-01 


GO: 


: 0005489 


electron transporter activity 


47 


5.4e-01 


GO: 


: 0005386 


carrier activity 


73 


5.5e-01 


GO: 


: 0004888 


transmembrane receptor activity 


59 


5.7e-01 


GO: 


: 0003824 


catalytic activity 


635 


5.8e-01 


GO: 


: 0003676 


nucleic acid binding 


449 


6.7e-01 



Panel (a) in Figure 12 indicates that genes annotated with BP GO term anti- 
apoptosis (GO : 0006916) tend to be over-expressed in BCR/ABL compared to NEG 
cell samples. Among these 21 genes, only S0CS2 is significantly differentially ex- 
pressed between BCR/ABL and NEG B-cell ALL (nominal EWER level a = 0.05, 
Table 3). However, a brief survey of the literature reveals that several of the genes in 
Table 11 interact with the BCR/ABL proto-oncogene. Eor instance, [27] investigate 
mechanisms for the BCR/ABL-mediated activation of the transcription factor NE- 
kB/RcI encoded by the NFKBl gene. Their findings suggest that NE-kB/RcI may 
be a potential target for molecular therapies of leukemia. [30] demonstrate that 
ectopic expression of BCR/ABL interferes with the tumor necrosis factor (TNF) 
signaling pathway through the down-regulation of TNE receptors. The TNF gene 
encodes a multifunctional proinflammatory cytokine involved in the regulation of 
a wide spectrum of biological processes, including cell proliferation, differentiation, 
apoptosis, lipid metabolism, and coagulation. The TNF gene has been implicated in 
a variety of diseases, including autoimmune diseases, insulin resistance, and can- 
cer. 

As seen in Table 12, 22 of the 24 genes annotated with MF GO term structural 
constituent of ribosome (GO: 0003735) code for ribosomal proteins. Although none 
of the 24 annotated genes is identified as being significantly differentially expressed 
between BCR/ABL and NEG B-cell ALL (nominal EWER level a = 0.05, Table 
3), Panel (b) in Figure 12 suggests that these genes tend to be under-expressed in 
BCR/ABL cell samples. 
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Fig 9. GO terms associated with differential gene expression between BCR/ABL and NEG B-cell 
ALL, DAG for top 20 BP GO terms. Portion of the directed acyclic graph for the 20 GO terms 
with the smallest adjusted p-values for Scenario MT[t, t] applied to the BP gene ontology. Nodes 
for the top 20 terms are shaded in lavender; black and red edges indicate, respectively, "is a" 
and "part of a" relationships among terms. The figure was produced using the QuickGO browser. 
According to QuickGO, the GO term IDs GD: 019882 and GO: 0030333 listed in Table 8 correspond 
to the same term, antigen processing and presentation. (Higher-resolution color version on website 
companion.) 



6. Discussion 

We have proposed a general and formal statistical framework for multiple tests of 
association with biological annotation metadata. A key component of our approach 
is the systematic and precise translation of a generic biological question into a cor- 
responding multiple hypothesis testing problem, concerning association measures 
between known gene- annotation profiles and unknown gene-parameter profiles. This 
general and rigorous formulation of the statistical inference question allows us to 
apply the multiple hypothesis testing methodology developed in [14] and related 
articles, to control a broad class of Type I error rates, in testing problems involving 
general data generating distributions (with arbitrary dependence structures among 
variables), null hypotheses, and test statistics. 

The flexibility of our approach was illustrated using the ALL microarray dataset 
of [13], with the aim of relating GO annotation to differential gene expression be- 
tween BCR/ABL and NEG B-cell ALL. This analysis demonstrates the impor- 
tance of selecting a suitable DE gene-parameter profile A and measure p for the 
association between this gene-parameter profile and GO gene-annotation profiles 
A. Indeed, for the ALL dataset, the choice of gene-parameter profile for measur- 
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Fig 10. GO terms associated with differential gene expression between BCR/ABL and NEG B- 
cell ALL, DAG for top 20 CC GO terms. Portion of the directed acyclic graph for the 20 GO 
terms with the smallest adjusted p-values for Scenario MT[t, t] applied to the CC gene ontology. 
Nodes for the top 20 terms are shaded in lavender; black and red edges indicate, respectively, 
"is a" and "part of a" relationships among terms. The figure was produced using the QuickGO 
browser. (Higher-resolution color version on website companion.) 



ing differential expression between BCR/ABL and NEG B-cell ALL has a large 
impaet on the list of identified GO terms. Testing seenarios based on binary DE 
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Fig 11. GO terms associated with differential gene expression between BCR/ABL and NEG B- 
cell ALL, DAG for top 20 MF GO terms. Portion of the directed acyclic graph for the 20 GO 
terms with the smallest adjusted p-values for Scenario MT[i,t] applied to the MF gene ontology. 
Nodes for the top 20 terms are shaded in lavender; black and red edges indicate, respectively, 
"is a" and "part of a" relationships among terms. The figure was produced using the QuickGO 
browser. (Higher-resolution color version on website companion.) 



7.5 8.0 8.5 9.0 

(fiBCR/ABL.n + ^NEG,n)/2 




10 11 12 

((^BCR/ABL.n + fANEG,n)/2 



Panel (a): BP G0:0006916 



Panel (b): MF GO: 0003735 



Fig 12. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, BP GO term 00:0006916 and MF GO term 00:0003735. This figure displays mean- 
difference plots of average expression measures in BCR/ABL and NEG cell samples, i.e., plots of 
lJ^BCR/ABL,^(g)- IJ'NEG.riig) VS. (/ifiCfl/Aisi, .,i (s) + Mneg.ti (9))/2, for gcucs directly or indirectly 
annotated with GO terms GD: 0006916 (Panel (a)) and 00:0003735 (Panel (b)). The term anti- 
apoptosis (GD : 0006916) has the ninth smallest adjusted p- value for Scenario MT[4 , t] applied to the 
BP gene ontology (Tables 8 and 11) and the term structural constituent ofribosome (00:0003735) 
has the smallest adjusted p-value for Scenario MT[t,t] applied to the MF gene ontology (Tables 
10 and 12). 



gene-parameter profiles (Scenario MT[^, x]) tended to be more conservative than 
scenarios based on continuous DE gene-parameter profiles (Scenarios MT[i,t] and 
MT[(i, t]), with little overlap between the lists of identified GO terms. Furthermore, 
testing scenarios based on binary gene-parameter profiles were sensitive to the some- 
what arbitrary DE/non-DE gene dichotomization, that is, Scenario MT[7^,x : 7G] 
lacked robustness with respect to the choice of parameter 7G for the number of 
genes called differentially expressed according to the estimator X^^q. In contrast, 
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Table 11. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, BP GO term GO : 0006916. This table lists genes directly or indirectly annotated with 
GO term anti-apoptosis (out of G = 2,071 genes, GOALLLQCUSID environment in GO package). The 
term anti-apoptosis (GO: 0006916) has the ninth smallest adjusted p-value for Scenario MT[t,t] 
applied to the BP gene ontology (Table 8). 



BP GO: 0006916 



Probe ID 


Symbol 


Name 


1237_at 


IER3 


immediate early response 3 


1295_at 


RELA 


v-rel reticuloendotheliosis viral oncogene homolog A, 






nuclear factor of kappa light polypeptide gene enhancer 






in B-cells 3, p65 (avian) 


1377_at 


NFKBl 


nuclear factor of kappa light polypeptide gene enhancer 






in B-cells 1 (pl05) 


1564_at 


AKTl 


v-akt murine thymoma viral oncogene homolog 1 


1830_s_at 


TGFBl 


transforming growth factor, beta 1 






(Camurati-Engelmann disease) 


1852_at 


TNF 


tumor necrosis factor (TNF superfamily, member 2) 


1997_s_at 


BAX 


BCL2-associated X protein 


277_at 


MCLl 


myeloid cell leukemia sequence 1 (BCL2-related) 


31536_at 


RTN4 


reticulon 4 


32060_at 


BNIP2 


BCL2/adenovirus ElB 19 kDa interacting protein 2 


33284_at 


MPO 


myeloperoxidase 


36578_at 


BIRC2 


baculoviral lAP repeat-containing 2 


38578_at 


TNFRSF7 


tumor necrosis factor receptor superfamily, member 7 


38771_at 


HDACl 


histone deacetylase 1 


38994_at 


S0CS2 


suppressor of cytokine signaling 2 


39097_at 


SDN 


SON DNA binding protein 


39378_at 


BECNl 


beclin 1 (coiled-coil , myosin-like BCL2 interacting protein) 


39436_at 


BNIP3L 


BCL2/adenovirus ElB 19 kDa interacting protein 3-like 


40570_at 


FOXDIA 


forkhead box OlA (rhabdomyosarcoma) 


595_at 


TNFAIP3 


tumor necrosis factor, alpha-induced protein 3 


641_at 


PSENl 


presenilin 1 (Alzheimer disease 3) 



continuous gene-parameter profiles based on standardized and unstandardized mea- 
sures of differential gene expression lead to very similar results (Scenarios MT[i,t] 
and MT[d,t\). 

Our analysis of the ALL microarray datasct clearly shows the limitations of bi- 
nary gene-paramctcr profiles of differential expression indicators, which arc still 
the norm for combined GO annotation and microarray data analyses. Our pro- 
posed statistical framework, with general definitions for the gene-annotation and 
gene-parameter profiles, allows consideration of a much broader class of inference 
problems, that extend beyond GO annotation and microarray data analysis. Gene- 
annotation profiles may be continuous or polychotomous and may correspond, 
for example, to exon/intron counts/lengths/nucleotide distributions, gene pathway 
membership, or gene regulation by particular transcription factors. Likewise, gene- 
parameter profiles may be continuous or polychotomous and may correspond, for 
example, to regression coefficients relating possibly censored biological and clinical 
outcomes to genome-wide transcript levels, DNA copy numbers, and other covari- 
ates. 

This first application of our proposed methodology only considered control of 
the family- wise error rate using single-step common-cut-off maxT Procedure 1, 
based on the non-parametric bootstrap null shift and scale-transformed test statis- 
tics null distribution of Procedure 2. Adjusted p-values tended to be quite large, 
with only a handful of GO terms identified as being significantly associated with 
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Table 12. GO terms associated with differential gene expression between BCR/ABL and NEG 
B-cell ALL, MF GO term GO: 0003735. This table lists genes directly or indirectly annotated with 
GO term structural constituent of ribosome (out of G = 2, 071 genes, GOALLLQCUSID environment in 
GO package). The term structural constituent of ribosome (00:0003735) has the smallest adjusted 
p-value for Scenario MT[t, t] applied to the MF gene ontology (Table 10). 



MF GD: 0003735 



Probe ID 


Symbol 


Name 






2016_s_at 


RPLIO 


ribosomal 


protein 


LIO 


31511_at 


RPS9 


ribosomal 


protein 


S9 


31546_at 


RPL18 


ribosomal 


protein 


L18 


31955_at 


FAU 


Finkel-Biskis-Reilly murine sarcoma virus (FBR-MuSV) 
ubiquitously expressed (fox derived) 


32221_at 


MRPS18B 


mitochondrial ribosomal protein S18B 


32315_at 


RPS24 


ribosomal 


protein 


S24 


32394_s_at 


RPL23 


ribosomal 


protein 


L23 


32433_at 


RPL15 


ribosomal 


protein 


L15 


32437_at 


RPS5 


ribosomal 


protein 


S5 


33117_r_at 


RPS12 


ribosomal 


protein 


S12 


33485_at 


RPL4 


ribosomal 


protein 


L4 


33614_at 


RPL18A 


ribosomal 


protein 


L18a 


33661_at 


RPL5 


ribosomal 


protein 


L5 


33668_at 


RPL12 


ribosomal 


protein 


L12 


33674_at 


RPL29 


ribosomal 


protein 


L29 


34316_at 


RPS15A 


ribosomal 


protein 


S15a 


36358_at 


RPL9 


ribosomal 


protein 


L9 


36572_r_at 


ARL6IP 


ADP-ribosylation factor-like 6 interacting protein 


36786_at 


RPLIOA 


ribosomal 


protein 


LlOa 


39856_at 


RPL36AL 


ribosomal 


protein 


L36a-like 


39916_r_at 


RPS15 


ribosomal 


protein 


S15 


41152_f_at 


RPL36A 


ribosomal 


protein 


L36a 


41214_at 


RPS4Y1 


ribosomal 


protein 


S4, Y-linked 1 


41746_at 


NHP2L1 


NHP2 non-histone chromosome protein 2-like 1 






(S. cerevisiae) 





BCR/ABL vs. NEG differential gene expression. Joint augmentation and empirical 
Bayes procedures could be used for control of a broader and more biologically rele- 
vant class of Type I error rates, defined as generalized tail probabilities, gTP{q, g) = 
Pi{g{Vn, Rn) > q), and generafized expected values, gEV{g) = F,[g{Vn, Rn)], for 
arbitrary functions g{Vn,Rn) of the numbers of false positives Vn and rejected 
hypotheses i?„ (Chapters 6 and 7 in [14], [15, 39, 40]). Error rates based on the 
proportion Vn/Rn of false positives (e.g., TPPFP and FDR) are especially appeal- 
ing for large-scale testing problems, compared to error rates based on the number 
Vn of false positives (e.g., gFWER), as they do not increase exponentially with 
the number M of tested hypotheses. More powerful analyses may also be achieved 
with the new null quantilc-transformed test statistics null distribution of [42] . The 
multiple testing methodology developed in [14] and related articles is particularly 
well-suited to handle the variety of parameters of interest and the complex and 
unknown dependence structures among test statistics (e.g., implied by the DAG 
structure of GO terms) that are likely to be encountered in high-dimensional infer- 
ence problems in biomedical and genomic research. 

Note that for asymptotic results, such as consistency or asymptotic linearity, 
the sample size n refers to the number of observational units sampled from the 
population of interest to estimate the gene-parameter profiles, e.g., the number of 
patients in a cancer microarray study. While the sample size n is typically much 
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smaller than the dimension J of the data structure X, sample sizes have consider- 
ably increased in recent genomic applications. In addition, simulation studies have 
indicated that our proposed MTPs have good finite sample properties in terms of 
both Type I error control and power. 

Ongoing efforts include consideration of more general and biologically pertinent 
multivariate association measures p. For instance, for GO annotation metadata, the 
association parameter for a given GO term could take into account the structure of 
the DAG by considering the gene-annotation profiles of offspring or ancestor terms. 
We are also interested in developing better numerical and graphical approaches 
for representing and interpreting the multiple testing results, e.g., the lists of GO 
terms and associated adjusted p-values. Finally, we arc planning on implementing 
the proposed methods in an R package to be released as part of the Bioconductor 
Project. 



Software and website companion 

The multiple testing procedures proposed in [14] and related articles [8, 1-5, 16, 31, 
32, 33, 34, 39, 40, 41, 42] are implemented in the R package multtest, released as 
part of the Bioconductor Project, an open-source software project for the analysis 
of biomedical and genomic data ([14, Section 13.1]; [32]; www.biocond.uctor.org). 

The experimental data (ALL) and annotation metadata (annaffy, annotate, GO, 
hgu95av2) packages used in the analysis of Section 5 may also be obtained from the 
Bioconductor Project website. 

The website companion to [14] provides additional tables, figures, code, and 
references: www . stat .berkeley.edu/~sandrine/MTBook. 
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